Measuring Safety Alignment Effects in Autonomous Security Agents
Isaac David, Arthur Gervais
Why It Matters
What makes this one worth your time
Understanding safety alignment in autonomous agents is crucial for developing reliable security systems that can effectively handle vulnerabilities without compromising safety.
The study assesses how safety alignment affects the performance of autonomous security agents using a new benchmark.
Summary
The paper evaluates the safety alignment effects of autonomous security agents by comparing stock language models and their less-restricted derivatives on vulnerability-analysis tasks. It introduces a trace-based benchmark and analyzes the performance of different models, highlighting the need to measure safety alignment at the system level.
Key contributions
- Introduced a trace-based benchmark for evaluating autonomous security agents.
- Compared performance of stock and less-restricted language models on security tasks.
- Highlighted the complexity of measuring safety alignment effects beyond refusal rates.
Notable insights
- Safety alignment should be measured at the system level, considering refusal, unsafe actions, tool reliability, and evidence grounding.
- Less-restricted models may show improved performance in specific tasks but not universally across all tasks.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.19722v1 Announce Type: cross Abstract: Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal.