Decomposing and Measuring Evaluation Awareness

Changling Li, Terry Jingchen Zhang, Jie Zhang, Zhijing Jin, Sahar Abdelnabi, Maksym Andriushchenko

Published May 25, 2026Featured #3In the daily list May 26, 2026

Open on arXiv Read PDF

Daily score71.2

Editorial review7.5

Relevance0.455

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding evaluation awareness is crucial for improving the validity of benchmark results and ensuring that language models behave consistently across different evaluation contexts.

The study introduces EvalAwareBench to assess and manage evaluation awareness in language models.

Summary

The paper investigates how frontier language models recognize they are being evaluated and how this awareness affects their behavior, proposing a framework and a benchmark, EvalAwareBench, to measure and mitigate evaluation awareness.

Key contributions

Decomposes evaluation awareness into environment and model components.
Introduces EvalAwareBench, a controlled benchmark for studying evaluation awareness.
Identifies specific trigger factors that influence model recognition and behavior.

Notable insights

Recognition rates of models depend on the specific pairing of model and benchmark, indicating a complex interaction rather than isolated effects.
Stacking multiple evaluative factors progressively increases evaluation awareness across all models.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.23055v1 Announce Type: cross Abstract: Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.