When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch

Published May 27, 2026Featured #9In the daily list May 28, 2026

Open on arXiv Read PDF

Daily score59.1

Editorial review6.8

Relevance0.482

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding and mitigating self-bias in LLM evaluations is crucial for developing fair and reliable AI systems, especially as automated benchmarking becomes more prevalent.

The paper reveals and addresses self-bias in LLM-generated benchmarks.

Summary

The paper investigates the self-bias problem in automated evaluation systems where large language models (LLMs) generate and evaluate their own benchmarks. It identifies two sources of self-bias: the generation of test inputs and the evaluation of outputs, and shows that these biases can lead models to rank themselves higher than others. The study uses machine translation and open-ended generation tasks to demonstrate the issue and proposes a diversity metric to partially mitigate the bias.

Key contributions

Identification of self-bias in LLM-generated benchmarks.
Analysis of self-bias sources in both test input generation and output evaluation.
Proposal of a diversity metric to mitigate self-bias.

Notable insights

LLM-generated benchmarks systematically favor the model that created them, due to implicit stylistic tendencies.
Increasing source text diversity can partially mitigate self-bias in LLM evaluations.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2509.26600v2 Announce Type: replace-cross Abstract: As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a cheap alternative to human curation. We show that this paradigm has a fundamental problem: LLM-generated benchmarks systematically favor the model that created them. Using machine translation as our primary testbed, we find that self-bias arises from two additive sources, LLM-as-a-testset and LLM-as-an-evaluator, and their combination amplifies the effect. Crucially, even when test data is generated with explicit diversity controls, each model's implicit stylistic tendencies produce homogeneous, model-specific outputs that inflate its own scores. Increasing source text diversity, using our proposed diversity metric, partially mitigates this bias. Self-bias is strong enough to cause each model to rank itself first, overriding the peer-consensus ordering. We confirm that the phenomenon extends to open-ended generation on the Chatbot Arena task.