When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Dasol Choi, Alex Kwon

Published May 28, 2026Featured #3In the daily list May 29, 2026

Open on arXiv Read PDF

Daily score73.1

Editorial review7.5

Relevance0.476

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding and addressing brittle safety is crucial for the deployment of reliable AI systems, particularly in high-stakes applications where safety is paramount.

This research uncovers critical vulnerabilities in the safety mechanisms of aligned language models under context changes.

Summary

The paper introduces the concept of 'brittle safety' in aligned language models, demonstrating through context-flip evaluation that these models often fail to adapt their safety responses to changing contexts, and proposes a new evaluation protocol and state-aware validation method.

Key contributions

Introduction of context-flip evaluation for assessing safety in language models.
Identification of the safety-commonsense gap across multiple models.
Development of a state-aware validator that outperforms traditional action-level guardrails in detecting safety failures.

Notable insights

Brittle safety is safety-specific and varies significantly across different models, indicating a need for tailored safety evaluations.
Standard action-level guardrails are ineffective in detecting consequence-flips, highlighting a gap in current safety mechanisms.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.27851v1 Announce Type: new Abstract: Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.