An LLM-Native Psychometric Instrument Reveals a Self-Report--Behavior Gap Across 25 Models

Juan Manuel Contreras

Published Jul 8, 2026Featured #3In the daily list Jul 9, 2026

Open on arXiv Read PDF

Daily score72.4

Editorial review7.5

Relevance0.468

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding the discrepancies between LLM self-reports and their behavior is crucial for improving model evaluation and alignment, impacting how LLMs are deployed in real-world applications.

A new psychometric tool exposes the disconnect between LLM self-reports and actual behavior.

Summary

The paper introduces a novel psychometric instrument tailored for large language models (LLMs), revealing a significant self-report-behavior gap across multiple models and providing insights into LLM behavior that diverges from human psychology.

Key contributions

Development of a psychometric instrument based on LLM behavior rather than human psychology.
Exploratory factor analysis revealing five reliable factors that characterize LLM behavior.
Empirical evidence of a self-report-behavior gap, challenging existing assumptions about LLM evaluations.

Notable insights

The study identifies five distinct behavioral factors in LLMs that differ from traditional human psychological constructs.
The research highlights a unique source of variance in LLM self-reports that is not captured by human evaluations, suggesting a need for new assessment frameworks.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2606.09843v3 Announce Type: replace-cross Abstract: Large language models (LLMs) give stable answers to personality questionnaires, yet these self-reports fail to predict how the models behave. Is this gap an artifact of forcing human trait categories onto LLMs, or something deeper about LLM self-report? To find out, we built the first psychometric instrument whose dimensions are derived from LLM behavior rather than human psychology. Administering 300 items (240 Likert + 60 scenario) to 25 LLMs across 17 model families, 30 times each, exploratory factor analysis revealed five reliable, replicable factors: Responsiveness, Deference, Boldness, Guardedness, and Verbosity (all Tucker $\phi \geq .957$, all $\alpha \geq .930$). We collected 2,500 open-ended samples and had them rated by 151 humans and a three-judge LLM ensemble. Humans and judges agreed ($\bar{r} = .51$), but self-report predicted neither the ratings nor objective text measures computed from them: the gap persists even for constructs native to LLMs, where a human-mismatch explanation no longer applies. The exception is Verbosity, whose self-report reaches 74% of the criterion-reliability ceiling against human ratings, but does not track raw output length. On Responsiveness, self-report tracked LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges otherwise agreed ($r = .59$). This pattern formally rejects any single latent construct driving all three measurements ($p = .007$). Self-report items and LLM judges share a source of variance that human observers do not, and controlling for measurable surface features (length, formatting, enthusiasm markers) does not remove it. This confound is invisible to the within-ensemble reliability checks used to validate LLM judges, and it poses a concrete risk for the LLM-as-judge pipelines now central to model evaluation. We release the instrument as a diagnostic probe for alignment-shaped self-description.