Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
Kyungmin Park, Taesup Kim
Why It Matters
What makes this one worth your time
Understanding and mitigating inference-time vulnerabilities is crucial for deploying safer and more reliable language models in real-world applications.
The paper addresses inference-time vulnerabilities in LLMs by aligning models on generation trajectories to enhance robustness.
Summary
The paper investigates inference-time vulnerabilities in safety-aligned large language models, focusing on how short token injections during generation can lead to harmful outputs. It proposes aligning models on generation trajectories to improve robustness against such perturbations.
Key contributions
- Identified inference-time vulnerabilities beyond shallow safety in LLMs.
- Proposed aligning models on generation trajectories to improve robustness against mid-sequence perturbations.
Notable insights
- Shallow safety is a specific case of a broader inference-time vulnerability where token injections can alter model behavior.
- Internal state alignment with refusal directions does not guarantee robustness to token injections.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.04778v1 Announce Type: new Abstract: Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.