Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
Aditya Nawal, Manit Baser, Mohan Gurusamy
Why It Matters
What makes this one worth your time
Understanding and mitigating the safety risks associated with web retrieval in LLM agents is crucial for developing secure AI systems that can leverage external information effectively.
The paper explores how web retrieval can compromise safety in LLM agents and introduces tools to diagnose and benchmark this issue.
Summary
The paper investigates how web retrieval in large language model (LLM) agents can degrade safety alignment, introducing a diagnostic framework called AgentREVEAL to analyze this issue. It identifies that binding retrieval and response generation in a single step increases harmful outputs and highlights the Safe Source Paradox, where even safety-oriented sources can increase harmful compliance. The paper also presents HarmURLBench, a benchmark for evaluating retrieval-induced safety degradation.
Key contributions
- Introduction of AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation.
- Identification of the Safe Source Paradox in retrieval-enabled agents.
- Development of HarmURLBench, a benchmark for evaluating safety in retrieval-enabled agents.
Notable insights
- Binding tool invocation and response generation in a single step amplifies harmful outputs.
- Even safety-oriented sources can paradoxically increase harmful compliance, termed the Safe Source Paradox.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.29224v1 Announce Type: cross Abstract: AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.