Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

Yufeng Wang

Published Jul 8, 2026

Editorial review6.8

Relevance0.557

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and improving the faithfulness of LLM agents in decision-making is crucial for their reliable application in complex simulations and real-world scenarios.

The paper examines the faithfulness gap in LLM agents' decision-making processes, highlighting distinct error patterns and potential improvements.

Summary

The paper investigates the faithfulness of large language model (LLM) agents in executing decisions based on their stated reasoning in a controlled Texas Poker simulation. It identifies two distinct steps in the faithfulness gap: reasoning-to-conclusion and conclusion-to-action, with the former being more prone to errors. The study finds that errors in reasoning-to-conclusion are model-dependent and often risk-averse, suggesting that instructing agents to mechanically apply rules can improve fidelity.

Key contributions

Identification of distinct steps in the faithfulness gap: reasoning-to-conclusion and conclusion-to-action.
Empirical analysis of error patterns in LLM agents' decision-making processes.
Demonstration of model-dependent error composition and risk-averse tendencies.

Notable insights

Errors in reasoning-to-conclusion are model-dependent and often risk-averse.
Instructing agents to apply rules mechanically can significantly reduce misapplication rates.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.00476v2 Announce Type: replace Abstract: Do LLM agents act on the reasoning they state? This question of process fidelity is central to LLM-based social simulation, yet hard to measure where no reference for correct behavior exists. We study it in a controlled setting: a Texas Poker simulator with a verifiable reference action for every decision by splitting the faithfulness gap into two steps: reasoning-to-conclusion (does the stated decision follow from the agent's own reasoning?) and conclusion-to-action (does the agent execute what it states?). The two steps behave very differently. Conclusion-to-action is reliable: inconsistency is 0.7% for Claude Haiku 4.5 and 1.4% for DeepSeek-Reasoner once the conclusion is read from an explicit tag, whereas free-text conclusion extraction reports 22-26%. Reasoning-to-conclusion is where fidelity frays, but not through a single dominant failure. In a step-level diagnostic the agent's errors split roughly evenly between bad inputs, borderline cases, and rule misapplication deriving a conclusion that contradicts the agent's own restated rule from inputs it estimated correctly. This composition is model-dependent: rule misapplication accounts for a third of Haiku's interpretable errors but only 8% of DeepSeek's. The one robust signal is directional: when an agent does misapply its own stated rule, it almost always (99.5% for Haiku) errs in the risk-averse direction. The override is partly hedging behavior, not a capability limit: instructing the agent to apply the rule mechanically halves the misapplication rate (13.9% to 6.8% of decisions) and raises adherence by eight points. Process-fidelity evaluation should therefore elicit machine-checkable conclusions and probe for directional biases rather than assume a single upstream failure mode, lest it conflate measurement noise with model behavior.