Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

Yanjie He

Score8.500

LLMn/a

Embedding0.469

Recencyn/a

Feedback

Why It Matters

Understanding how LLMs handle counterfactual reasoning is crucial for their application in real-world policy contexts, where intuitive biases can lead to misinterpretations of complex data.

Contributions

Establishes a benchmark for evaluating LLMs in policy contexts and identifies the significant role of intuitiveness in their reasoning performance.

Insights

LLMs may have knowledge but struggle to apply it effectively when findings contradict common intuitions.

Limitations

The study's findings may not generalize across all domains of policy evaluation or different types of reasoning tasks.

Abstract

arXiv:2604.10511v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction OR = 0.053, $p < 0.001$); (2) intuitiveness as the dominant factor, explaining more variance than model choice or prompting strategy (ICC = 0.537); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.53$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs' "slow thinking" may be little more than "slow talking" -- they produce the form of deliberative reasoning without the substance.

arXiv PDF