Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
Haoxiang Wang, Da Yu, Huishuai Zhang
Why It Matters
What makes this one worth your time
This approach could lead to more nuanced and informative evaluations of language models, highlighting specific capability gaps and improving model development.
Dynamic Boundary Evaluation offers a new approach to assess language models by focusing on their performance boundaries.
Summary
The paper introduces Dynamic Boundary Evaluation (DBE), a novel method for evaluating large language models by identifying their performance boundaries and placing them on a globally comparable difficulty scale. It provides a calibrated item bank, a search algorithm for boundary items, and an adaptive evaluation protocol, covering safety, capability, and truthfulness categories.
Key contributions
- Introduction of Dynamic Boundary Evaluation for language models.
- Development of a calibrated item bank with difficulty labels validated across multiple models.
- Creation of a search algorithm for finding boundary items using API-level access.
Notable insights
- The concept of evaluating models at their performance boundary rather than using fixed benchmarks is a novel approach.
- Skill-Guided Boundary Search allows for targeted evaluation using only API-level access, which can be resource-efficient.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.06213v2 Announce Type: replace Abstract: Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.