Agent Safety Security Benchmark Alignment

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, Binxin Hu, Ling Tang, Jilin Mei, Dadi Guo, Leitao Yuan, Junyao Yang, Guanxu Chen, Qihao Lin, Yi Yu, Bo Zhang, Jiaxuan Guo, Jie Zhang, Wenqi Shao, Huiqi Deng, Zhiheng Xi, Wenjie Wang, Wenxuan Wang, Wen Shen, Zhikai Chen, Haoyu Xie, Jialing Tao, Juntao Dai, Jiaming Ji, Zhongjie Ba, Linfeng Zhang, Yong Liu, Quanshi Zhang, Lei Zhu, Zhihua Wei, Hui Xue, Chaochao Lu, Jing Shao, Xia Hu

Published Apr 24, 2026

Open on arXiv Read PDF

Editorial review7.0

Relevance0.497

Freshness0.000

Why It Matters

What makes this one worth your time

AI engineers and researchers should care because AgentDoG offers a structured approach to diagnosing and mitigating complex safety and security risks in AI agents, which is crucial for deploying reliable autonomous systems.

AgentDoG enhances AI agent safety with a novel diagnostic framework and benchmark.

Summary

The paper introduces AgentDoG, a framework for enhancing AI agent safety and security through a three-dimensional taxonomy of agentic risks and a new safety benchmark called ATBench. It provides fine-grained monitoring and diagnosis of agent behaviors, offering transparency and provenance beyond binary safety labels.

Key contributions

Introduction of a three-dimensional taxonomy for categorizing agentic risks.
Development of a new agentic safety benchmark, ATBench.
Creation of the AgentDoG framework for fine-grained monitoring and diagnosis of agent behaviors.

Notable insights

The use of a three-dimensional taxonomy to categorize agentic risks provides a structured approach to understanding and mitigating risks.
AgentDoG's ability to diagnose root causes of unsafe actions offers a level of transparency and provenance not typically available in current safety frameworks.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2601.18491v2 Announce Type: replace Abstract: The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.