Back to top papers

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

Yanjie He

Score8.500
LLMn/a
Embedding0.469
Recencyn/a

Feedback

Why It Matters

Understanding how LLMs handle counterfactual reasoning is crucial for their application in real-world policy contexts, where intuitive biases can lead to misinterpretations of complex data.

Contributions

  • Establishes a benchmark for evaluating LLMs in policy contexts and identifies the significant role of intuitiveness in their reasoning performance.

Insights

  • LLMs may have knowledge but struggle to apply it effectively when findings contradict common intuitions.

Limitations

  • The study's findings may not generalize across all domains of policy evaluation or different types of reasoning tasks.

Tags

  • benchmark
  • evaluation
  • llm
  • reasoning

Abstract

arXiv:2604.10511v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction OR = 0.053, $p < 0.001$); (2) intuitiveness as the dominant factor, explaining more variance than model choice or prompting strategy (ICC = 0.537); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.53$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs' "slow thinking" may be little more than "slow talking" -- they produce the form of deliberative reasoning without the substance.