COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents
Wenkai Shen, Pengyang Zhou, Jiahe Xu, Jiaming Qian, Haozhe He, Zhihao Huang, Chaochao Chen, Xiaolin Zheng
Why It Matters
What makes this one worth your time
Ensuring the safety of AI systems, especially those capable of multi-step reasoning, is crucial for preventing harmful outcomes and maintaining trust in AI technologies.
COMPASS enhances safety alignment in LLM-powered search agents through cognitive MCTS-guided process alignment.
Summary
The paper introduces COMPASS, a framework for aligning the safety of LLM-powered search agents by using Cognitive MCTS-Guided Process Alignment. It aims to address safety issues arising from multi-step reasoning and tool use by integrating cognitive tree exploration and introspective step-wise alignment to identify and supervise risky actions.
Key contributions
- Introduction of COMPASS, a framework for safety alignment in search agents.
- Integration of cognitive tree exploration for efficient synthesis of attack trajectories.
- Development of introspective step-wise alignment for isolating risky actions.
Notable insights
- The use of cognitive tree exploration to synthesize stealthy attack trajectories is a novel approach to identifying potential safety risks.
- Introspective step-wise alignment allows for fine-grained supervision of intermediate actions, potentially improving safety without compromising utility.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.30838v1 Announce Type: new Abstract: LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.