When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Ismail Hossain, Sai Puppala, Jannatul Ferdaus, Md Jahangir Alam, Yoonpyo Lee, Syed Bahauddin Alam, Sajedul Talukder

Published May 7, 2026

Editorial review6.8

Relevance0.492

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and mitigating safety vulnerabilities in AI guard models is crucial for ensuring reliable and secure deployment of agentic AI systems.

The paper addresses safety alignment collapse in guard models and proposes a regularization technique to restore safety.

Summary

The paper investigates the vulnerabilities in safety alignment of guard models when fine-tuned on benign data, leading to a collapse in safety geometry. It introduces a method called Fisher-Weighted Safety Subspace Regularization (FW-SSR) to mitigate these vulnerabilities by enhancing the safety subspace during training.

Key contributions

Identification of safety geometry collapse in guard models during benign fine-tuning.
Proposal of Fisher-Weighted Safety Subspace Regularization to enhance safety subspaces.
Demonstration of the method's effectiveness in recovering safety alignment in specific guard models.

Notable insights

Safety alignment can degrade not only through adversarial attacks but also through standard domain specialization.
Structural representational geometry is a more reliable predictor of safety behavior than absolute displacement metrics.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.02914v1 Announce Type: cross Abstract: A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification. We extract per-layer safety subspaces via SVD on class-conditional activation differences and track how this boundary evolves under benign fine-tuning. Granite Guardian undergoes complete collapse -- refusal rate drops from 85\% to 0\%, CKA falls to zero, and 100\% of outputs become ambiguous -- a severity exceeding prior findings on general-purpose LLMs, explained by the specialization hypothesis: concentrated safety representations are efficient but catastrophically brittle. To mitigate this, we propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), a training-time penalty combining (i) curvature-aware direction weights derived from diagonal Fisher information and (ii) an adaptive $\lambda_t$ that scales with task-safety gradient conflict. FW-SSR recovers 75\% refusal on Granite Guardian (CKA = 0.983) and reduces WildGuard's Attack Success Rate to 3.6\% -- below the unmodified baseline -- by actively sharpening the safety subspace rather than merely anchoring it. Across all three models, structural representational geometry (CKA, Fisher score) predicts safety behavior more reliably than absolute displacement metrics, establishing geometry-based monitoring as a necessary component of guard model evaluation in agentic deployments.