An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models
Juan Manuel Contreras
Why It Matters
What makes this one worth your time
Understanding the limitations of LLM self-reports is crucial for improving model alignment and reliability in real-world applications.
A new psychometric tool for LLMs reveals discrepancies between self-reported traits and actual behavior.
Summary
The paper presents a novel psychometric instrument designed specifically for large language models (LLMs) to predict their behavior. Despite stable self-reports on personality inventories, these self-reports do not align with observed behaviors. The study involved administering 300 items to 25 LLMs and identified a 5-factor structure through exploratory factor analysis. However, self-reports did not correlate well with human or LLM judge ratings, highlighting a potential confound in LLM self-reporting.
Key contributions
- Development of the first psychometric instrument tailored for LLMs.
- Identification of a 5-factor structure through exploratory factor analysis.
- Empirical evidence showing the gap between LLM self-reports and observed behavior.
Notable insights
- LLM self-reports and judge ratings share variance not captured by human observers, suggesting a unique confound in LLM self-reporting.
- The study introduces a bottom-up approach to psychometric instrument design for LLMs, focusing on behavioral affordances.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.09843v1 Announce Type: cross Abstract: Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker $\phi \geq .957$) and internal consistency (all $\alpha \geq .930$). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ($\bar{r} = .51$), but neither tracked self-report: self-report--human $\bar{r} = -.01$, self-report--judge $\bar{r} = .13$, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges agreed ($r = .59$) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.