Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant
Why It Matters
What makes this one worth your time
Understanding the limitations of LLMs in safety evaluations is crucial for developing more reliable AI systems, especially when these models are used at scale.
The paper explores the rigidity of LLMs as safety judges and their challenges in adapting to new contexts and definitions.
Summary
The paper investigates the limitations of using large language models (LLMs) as judges for evaluating safety, focusing on their reliance on context information and their ability to adapt to different safety definitions. It evaluates various LLMs and safety-specific models to understand their susceptibility to context and steerability.
Key contributions
- Evaluation of LLMs' susceptibility to context information in safety judgments.
- Investigation of LLMs' ability to adapt to varying safety definitions.
Notable insights
- LLM-judges are unlikely to adjust evaluations if context or safety definitions contradict their prior knowledge.
- The paper highlights the importance of context in safety evaluations by LLMs.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.07874v1 Announce Type: new Abstract: LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.