When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation
Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch
Why It Matters
What makes this one worth your time
Understanding and mitigating self-bias in LLM evaluations is crucial for developing fair and reliable AI systems, especially as automated benchmarking becomes more prevalent.
The paper reveals and addresses self-bias in LLM-generated benchmarks.
Summary
The paper investigates the self-bias problem in automated evaluation systems where large language models (LLMs) generate and evaluate their own benchmarks. It identifies two sources of self-bias: the generation of test inputs and the evaluation of outputs, and shows that these biases can lead models to rank themselves higher than others. The study uses machine translation and open-ended generation tasks to demonstrate the issue and proposes a diversity metric to partially mitigate the bias.
Key contributions
- Identification of self-bias in LLM-generated benchmarks.
- Analysis of self-bias sources in both test input generation and output evaluation.
- Proposal of a diversity metric to mitigate self-bias.
Notable insights
- LLM-generated benchmarks systematically favor the model that created them, due to implicit stylistic tendencies.
- Increasing source text diversity can partially mitigate self-bias in LLM evaluations.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2509.26600v2 Announce Type: replace-cross Abstract: As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a cheap alternative to human curation. We show that this paradigm has a fundamental problem: LLM-generated benchmarks systematically favor the model that created them. Using machine translation as our primary testbed, we find that self-bias arises from two additive sources, LLM-as-a-testset and LLM-as-an-evaluator, and their combination amplifies the effect. Crucially, even when test data is generated with explicit diversity controls, each model's implicit stylistic tendencies produce homogeneous, model-specific outputs that inflate its own scores. Increasing source text diversity, using our proposed diversity metric, partially mitigates this bias. Self-bias is strong enough to cause each model to rank itself first, overriding the peer-consensus ordering. We confirm that the phenomenon extends to open-ended generation on the Chatbot Arena task.