Understanding Annotator Safety Policy with Interpretability

Alex Oesterling, Donghao Ren, Yannick Assogba, Dominik Moritz, Sunnie S. Y. Kim, Leon Gatys, Fred Hohman

Published May 8, 2026

Editorial review7.2

Relevance0.479

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding the reasons behind annotation disagreements can lead to more effective and inclusive AI safety policies, which is crucial for developing reliable AI systems.

APMs make annotator reasoning visible, aiding in the design of clearer and more inclusive safety policies.

Summary

The paper introduces Annotator Policy Models (APMs), which are interpretable models designed to learn and reveal annotators' internal safety policies based on their labeling behavior. This approach aims to identify sources of annotation disagreement, such as policy ambiguity and value pluralism, without increasing annotation burden. The models are validated for accuracy and applied to both LLM and human annotations to enhance safety policy design.

Key contributions

Development of Annotator Policy Models (APMs) to interpret annotator behavior.
Validation of APMs' accuracy in modeling annotator safety policies.
Application of APMs to reveal policy ambiguities and value pluralism in annotations.

Notable insights

APMs can predict annotator responses to counterfactual edits, providing insights into their decision-making processes.
The models can uncover systematic differences in safety priorities across demographic groups, highlighting value pluralism.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.05329v1 Announce Type: new Abstract: Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety). Distinguishing these sources matters. For example, operational failures call for quality control, ambiguity calls for policy clarification, and pluralism calls for deliberation about incorporating diverse perspectives. Yet understanding why annotators disagree is difficult. Directly asking annotators for their reasoning is costly, substantially increasing annotation burden, and can be unreliable for both human and LLM annotators as self-reported reasoning often fails to reflect actual decision processes. We introduce Annotator Policy Models (APMs), interpretable models that learn annotators' internal safety policies from labeling behavior alone, making annotator reasoning visible and comparable without additional annotation effort. We validate that APMs accurately model annotator safety policy (>80% accuracy), faithfully predict responses to counterfactual edits, and recover known policy differences in controlled settings. Applying APMs to LLM and human annotations, we demonstrate two core applications: (1) surfacing policy ambiguity by revealing how annotators interpret safety instructions differently, and (2) surfacing value pluralism by uncovering systematic differences in safety priorities across demographic groups. Together, these capabilities support more targeted, transparent, and inclusive safety policy design.