LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
Igor Ivanov, David Demitri Africa
Why It Matters
What makes this one worth your time
Understanding and reducing evaluation awareness in language models is crucial for developing reliable safety and alignment benchmarks, which are essential for deploying AI systems in real-world scenarios.
LURE offers a method to reduce evaluation awareness in language models by simulating realistic interactions.
Summary
The paper introduces LURE, a method for creating more realistic evaluations of large language models by replaying real-world interaction trajectories and appending evaluation prompts. It also presents a pipeline for assessing the realism of these evaluations and validates it against a dataset of transcripts. The study suggests that LURE-based evaluations are less distinguishable from real deployments compared to traditional benchmarks.
Key contributions
- Proposing LURE for constructing deployment-like evaluations.
- Introducing an automated pipeline for measuring evaluation realism.
- Validating the approach on a large dataset of transcripts.
Notable insights
- The use of replaying realistic interaction trajectories to reduce evaluation awareness is a novel approach.
- Combining verbalized evaluation awareness detection with judge-model estimates to measure evaluation realism is an innovative methodology.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.26438v1 Announce Type: cross Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.