AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu
Why It Matters
What makes this one worth your time
AI engineers and researchers should care because it addresses emerging safety risks in AI agents, offering a scalable solution that reduces deployment overhead and enhances real-time safety moderation.
AgentDoG 1.5 offers a scalable framework for AI agent safety with state-of-the-art performance.
Summary
The paper presents AgentDoG 1.5, a lightweight and scalable framework for aligning AI agent safety and security, addressing risks from modern open-world agents and advanced AI models. It introduces an updated safety taxonomy, a taxonomy-guided data engine, and a training-free online guardrail for real-time safety moderation, achieving state-of-the-art performance in complex scenarios.
Key contributions
- Development of a lightweight and scalable agent safety alignment framework.
- Introduction of an updated agent safety taxonomy for modern AI risks.
- Creation of a highly efficient agentic safety SFT and RL training environment.
Notable insights
- The use of a taxonomy-guided data engine with influence-function purification to train models with minimal samples.
- Deployment of a training-free online guardrail for real-time safety moderation.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.