Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant

Published Jun 9, 2026

Editorial review6.5

Relevance0.488

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding the limitations of LLMs in safety evaluations is crucial for developing more reliable AI systems, especially when these models are used at scale.

The paper explores the rigidity of LLMs as safety judges and their challenges in adapting to new contexts and definitions.

Summary

The paper investigates the limitations of using large language models (LLMs) as judges for evaluating safety, focusing on their reliance on context information and their ability to adapt to different safety definitions. It evaluates various LLMs and safety-specific models to understand their susceptibility to context and steerability.

Key contributions

Evaluation of LLMs' susceptibility to context information in safety judgments.
Investigation of LLMs' ability to adapt to varying safety definitions.

Notable insights

LLM-judges are unlikely to adjust evaluations if context or safety definitions contradict their prior knowledge.
The paper highlights the importance of context in safety evaluations by LLMs.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.07874v1 Announce Type: new Abstract: LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.