Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking
Yuming (Rapheal), Huang, Yao Liu, Lei Wang, Junchen Wan
Why It Matters
What makes this one worth your time
This approach addresses the challenges of subjective evaluation in LLMs, potentially leading to more reliable assessments of their behavioral capabilities.
A new paradigm for LLM evaluation that emphasizes replication and multi-faceted reliability.
Summary
The paper proposes a replication-first paradigm for evaluating LLM behavior, focusing on emotional qualities, and introduces a methodology that certifies evaluation instruments through multiple reliability and replication metrics.
Key contributions
- Introduction of a replication-first paradigm for LLM behavioral benchmarking.
- Certification of evaluation instruments based on reliability, cross-instrument replication, historical calibration, and pre-registered predictions.
- Demonstration of the paradigm's effectiveness across multiple models and evaluation contexts.
Notable insights
- The use of four orthogonal properties for instrument certification provides a robust framework for evaluating LLM behavior beyond traditional methods.
- The paradigm's ability to self-evolve the evaluation rubric based on data-driven iterations is a novel approach to refining assessment criteria.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2605.27914v1 Announce Type: cross Abstract: Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target's training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus does not extend to capabilities where humans themselves disagree. We propose a replication-first paradigm: instead of anchoring on one rater group, we certify the instrument via four orthogonal properties -- reliability across K runs, cross-instrument replication across architecturally distinct judges, historical-footprint calibration via judges from earlier training cohorts, and pre-registered prediction. We test it on emotional accompaniment by letting the rubric self-evolve data-driven across iterations: the dimensions are not pre-stipulated and the procedure stabilizes to a 9-dimension set. Pre-registration applies to 10 falsifiable hypotheses and 11 forward predictions, committed before any test data was collected. Applied to 49 models across 8 families, the paradigm surfaces what aggregate scores hide. On advice-restraint -- whether a model refrains from giving unsolicited solutions in empathic contexts -- gpt-5 falls 1.87 points from gpt-4.1 and Opus-4.7 falls 0.629 from Opus-4.6, while aggregate scores stay flat. The regression survives three user-proxy swaps (95% of magnitude), replicates across a 5-family judge stack and a 17-month cohort gap, and persists on 74 held-out real ESConv conversations (rho in [0.749, 0.850]); the instrument reaches ordinal Krippendorff alpha = 0.91. As a by-product, the paradigm acts as a saturation-source diagnostic, separating instrumental ceilings (breakable by rubric refinement) from structural ceilings (needing scenario or roster intervention).