Superficial Beliefs in LLM Decision-Making
Gabriel Freedman, Francesca Toni
Why It Matters
What makes this one worth your time
Understanding the decision-making processes of LLMs is crucial for improving their reliability and interpretability, which is vital for their deployment in real-world applications.
The paper explores the concept of 'superficial belief' in LLM decision-making, where models show structured behavior without fully articulated beliefs.
Summary
The paper investigates whether large language models (LLMs) make decisions based on systematic structures or merely imitate rationales. Through synthetic binary decision settings, it examines the alignment between the attributes LLMs claim to prioritize and those inferred from their behavior. The study finds that while LLMs' behavior is not arbitrary, their explicit reasons do not fully align with the inferred decision drivers, suggesting a concept of 'superficial belief' in LLM decision-making.
Key contributions
- Analysis of LLM decision-making through synthetic binary decision settings.
- Comparison of self-reported and behaviorally inferred decision drivers in LLMs.
- Introduction of the concept of 'superficial belief' in LLM decision-making.
Notable insights
- LLMs exhibit structured decision-making behavior that can be predicted, yet their explicit reasoning does not fully align with inferred decision drivers.
- The concept of 'superficial belief' suggests LLMs operate with probabilistic local priorities over attributes.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.11016v1 Announce Type: new Abstract: We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.