Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation
Yanjie He
Why It Matters
What makes this one worth your time
Understanding how LLMs handle counterfactual reasoning is crucial for their application in real-world policy contexts, where intuitive biases can lead to misinterpretations of complex data.
The paper highlights how intuitiveness influences LLMs' counterfactual reasoning in policy evaluation.
Summary
This paper investigates the performance of large language models in counterfactual reasoning for policy evaluation, revealing that intuitiveness significantly affects their reasoning capabilities and suggesting a disconnect between knowledge and reasoning in counter-intuitive scenarios.
Key contributions
- Establishes a benchmark for evaluating LLMs in policy contexts and identifies the significant role of intuitiveness in their reasoning performance.
Notable insights
- LLMs may have knowledge but struggle to apply it effectively when findings contradict common intuitions.
Possible limitations
- The study's findings may not generalize across all domains of policy evaluation or different types of reasoning tasks.
Abstract
arXiv:2604.10511v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 8,000 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is substantially attenuated on counter-intuitive ones (interaction OR = 0.278, $p < 0.001$); (2) intuitiveness as the dominant factor, with case-level variance exceeding that of model choice or prompting strategy (ICC = 0.671); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.84$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs' "slow thinking" achieves only partial inhibition of intuitive priors -- producing the form of deliberative reasoning without fully delivering its substance.