When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
Enyi Jiang, Anders Gj{\o}lbye, Yibo Jacky Zhang, Sanmi Koyejo
Why It Matters
What makes this one worth your time
Understanding representation-level vulnerabilities is crucial for developing more robust AI systems, as current evaluations may provide a false sense of security.
This research reveals critical vulnerabilities in LLMs that behavioral safety metrics overlook.
Summary
The paper identifies a gap between behavioral safety evaluations and representation-level robustness in large language models, proposing a new evaluation framework and the Latent Vulnerability Score (LVS) to assess model vulnerabilities under intervention.
Key contributions
- Formalization of the audit gap between behavioral safety and representation-level robustness.
- Development of an intervention-based evaluation framework for testing model robustness.
- Introduction of the Latent Vulnerability Score (LVS) for measuring latent vulnerabilities.
Notable insights
- The introduction of the Latent Vulnerability Score (LVS) offers a novel metric for assessing model robustness beyond observable behavior.
- Dissociated models can maintain safe outward behavior while being internally vulnerable, highlighting the need for deeper evaluations.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2606.08044v1 Announce Type: cross Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.