Back to today's list

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

Juan Manuel Contreras

Published Jun 10, 2026Featured #7In the daily list Jun 11, 2026
Daily score60.3
Editorial review6.8
Relevance0.466
Freshness0.722

Why It Matters

What makes this one worth your time

Understanding the limitations of LLM self-reports is crucial for improving model alignment and reliability in real-world applications.

A new psychometric tool for LLMs reveals discrepancies between self-reported traits and actual behavior.

Summary

The paper presents a novel psychometric instrument designed specifically for large language models (LLMs) to predict their behavior. Despite stable self-reports on personality inventories, these self-reports do not align with observed behaviors. The study involved administering 300 items to 25 LLMs and identified a 5-factor structure through exploratory factor analysis. However, self-reports did not correlate well with human or LLM judge ratings, highlighting a potential confound in LLM self-reporting.

Key contributions

  • Development of the first psychometric instrument tailored for LLMs.
  • Identification of a 5-factor structure through exploratory factor analysis.
  • Empirical evidence showing the gap between LLM self-reports and observed behavior.

Notable insights

  • LLM self-reports and judge ratings share variance not captured by human observers, suggesting a unique confound in LLM self-reporting.
  • The study introduces a bottom-up approach to psychometric instrument design for LLMs, focusing on behavioral affordances.

Possible limitations

  • Not stated in the abstract

Abstract

arXiv:2606.09843v1 Announce Type: cross Abstract: Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker $\phi \geq .957$) and internal consistency (all $\alpha \geq .930$). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ($\bar{r} = .51$), but neither tracked self-report: self-report--human $\bar{r} = -.01$, self-report--judge $\bar{r} = .13$, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges agreed ($r = .59$) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.