When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
Mingyu Luo, Zihan Zhang, Zesen Liu, Yuchong Xie, Zhixiang Zhang, Dung Hiu Hilton Yeung, Wai Ip Lai, Ping Chen, Ming Wen, Dongdong She
Why It Matters
What makes this one worth your time
Understanding and mitigating response-path attacks is crucial for ensuring the reliability and security of LLM applications, especially in sensitive contexts.
This research exposes critical vulnerabilities in LLM agent architectures, highlighting the need for enhanced integrity measures.
Summary
The paper identifies and formalizes a new security threat, the Relay Tampering Attack (RTA), which exploits vulnerabilities in Bring-Your-Own-Key (BYOK) agent architectures to manipulate LLM outputs post-generation, demonstrating high attack success rates and evaluating several defenses.
Key contributions
- Formalization of the Relay Tampering Attack (RTA) and its implications for LLM integrity.
- Empirical evaluation of RTA's effectiveness against multiple LLMs and comparison with existing prompt-injection attacks.
- Development of a time-based detection defense mechanism to mitigate the identified vulnerabilities.
Notable insights
- The RTA demonstrates that even aligned LLMs can be compromised through strategic manipulation of outputs, emphasizing the importance of end-to-end integrity.
- The proposed time-based detection defense offers a novel approach to mitigating RTA while maintaining agent functionality.
Possible limitations
- The abstract does not address potential scalability issues of the proposed defense mechanism.
- No mention of the computational resources required for implementing the proposed defenses.
Abstract
arXiv:2605.02187v1 Announce Type: cross Abstract: Bring-Your-Own-Key (BYOK) agent architectures let users route LLM traffic through third-party relays, creating a critical integrity gap: a malicious relay can modify an aligned LLM response after generation but before agent execution. We formalize this post-alignment tampering threat and show that, without end-to-end integrity, the relay can observe, suppress, or replace downstream messages, making even perfectly aligned LLMs ineffective against such attacks. We instantiate this threat as the Relay Tampering Attack (RTA), which performs multi-round strategic rewriting, minimal security-critical edits, and stealth restoration by resubmitting tampered outputs to the upstream LLM. Across AgentDojo and ASB with six LLMs, RTA achieves up to 99.1% attack success, outperforming prompt-injection baselines with modest overhead. Case studies on OpenClaw and Claude Code demonstrate real-world feasibility, and evaluations of four defenses show that none fully prevent RTA. Finally, we propose a time-based detection defense that mitigates RTA while preserving agent utility.