DAR: Deontic Reasoning with Agentic Harnesses

Guangyao Dou, William Jurayj, Nils Holzenberger, Benjamin Van Durme

Published Jun 5, 2026

Editorial review6.8

Relevance0.500

Freshness0.000

Why It Matters

What makes this one worth your time

This work addresses a significant challenge in applying LLMs to complex legal reasoning tasks, potentially improving their utility in real-world applications.

DAR enhances deontic reasoning by enabling models to interactively access legal rules.

Summary

The paper introduces Deontic Agentic Reasoning (DAR), a framework for enhancing deontic reasoning in language models by allowing them to interact with legal statutes on demand, evaluated through various harnesses on DeonticBench.

Key contributions

Introduction of the DAR framework for deontic reasoning.
Evaluation of DAR under multiple agentic harnesses on DeonticBench.
Identification of performance variability across different model strengths.

Notable insights

Agentic harnesses can improve reasoning performance but may lead to degradation in numerical tasks for weaker models.
The interaction with statutes on demand is a novel approach that could change how LLMs handle complex rule-based reasoning.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2606.05009v1 Announce Type: cross Abstract: Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.