Back to today's list

Towards Context-Invariant Safety Alignment for Large Language Models

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

Published May 22, 2026
Editorial review6.8
Relevance0.541
Freshness0.000

Why It Matters

What makes this one worth your time

This approach could lead to more reliable and robust language models that maintain safety across varied and potentially adversarial contexts, which is crucial for deploying LLMs in real-world applications.

AIR enhances context-invariant safety alignment in LLMs by anchoring open-ended prompts to verifiable ones.

Summary

The paper introduces Anchor Invariance Regularization (AIR) to improve the context-invariant alignment of large language models by using verifiable prompts as anchors to regularize open-ended variants, enhancing safety and robustness against adversarial prompt framings.

Key contributions

  • Introduction of Anchor Invariance Regularization (AIR) for context-invariant alignment.
  • Demonstration of improved in-distribution and out-of-distribution performance using AIR.
  • Integration of AIR with group-based preference optimization techniques.

Notable insights

  • Using verifiable prompts as anchors provides a novel way to improve robustness without degrading performance on reliable variants.
  • The combination of AIR with group-based preference optimization suggests a structured approach to handle heterogeneous prompts.

Possible limitations

  • Not stated in the abstract

Abstract

arXiv:2605.20994v1 Announce Type: cross Abstract: Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.