CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks
Alexander Apartsin, Yehudit Aperstein
Why It Matters
What makes this one worth your time
This framework allows researchers and engineers to evaluate language models for specific applications without the need for labeled data or concerns about benchmark contamination, potentially saving time and resources.
CoEval ranks language models for custom tasks without labeled data or benchmark contamination.
Summary
The paper introduces CoEval, a framework for ranking language models for specific tasks without relying on labeled data or potentially contaminated benchmarks. It generates new, attribute-controlled benchmarks and uses a cross-family judge ensemble to rank models, validated by existing ground truth where available.
Key contributions
- Development of a label-free, contamination-free framework for ranking language models.
- Introduction of a cross-family judge ensemble for model evaluation.
- Demonstration of the framework's cost-effectiveness and applicability across domains.
Notable insights
- The framework generates fresh benchmarks on each run to avoid contamination.
- A diverse judge panel is more reliable than a single judge, which can be anti-correlated with ground truth.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.03650v2 Announce Type: replace-cross Abstract: Selecting a pretrained language model, or evaluating a fine-tuned one, for a specific application is a high-value decision, yet the public benchmarks used to make it are poorly suited: a generic benchmark need not reflect a particular sub-domain or sub-task, and its scores are suspect when its items have leaked into pretraining and are recalled rather than solved. We present CoEval, an open framework that supplies a trustworthy, task-specific signal through ensemble self-evaluation: from a task or domain description, a pool of models rotates through all three roles, teacher, student, and judge, to generate a fresh, contamination-free benchmark, answer it, and score one another, with no human labels or raters. Because every model also answers as a student, the responses are the data that weight each question by its discriminative power and each judge by its consensus with the panel. Where ground truth exists, CoEval recovers the true ranking and tracks objective correctness at \r{ho}=0.86, and the weighting recovers the gold ranking of thirteen models at Spearman 0.95. Reliability comes from panel composition, not size: this label-free weighting zeroes out broken judges and down-weights saturated questions, so neither distorts the ranking. Generated items show zero verbatim overlap with five public benchmarks, the panel cancels verbosity bias and precludes same-family self-preference, and rankings are domain-specific: three different models top four de-novo domains, so a generic leaderboard misdirects most practitioners. The same pipeline reruns on each model release, giving any team a contamination-free leaderboard for its application.