Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning
Nesreen K. Ahmed, Nima Nafisi
Why It Matters
What makes this one worth your time
Understanding and monitoring the intentions and beliefs of autonomous agents is crucial for ensuring their safe and reliable deployment in real-world applications.
Agent-ToM uses Theory-of-Mind reasoning to enhance monitoring of autonomous LLM agents.
Summary
The paper introduces Agent-ToM, a framework for monitoring autonomous large language model agents using Theory-of-Mind reasoning to detect covert malicious behavior. It performs structured analysis of agent trajectories by inferring beliefs and intentions, and employs a Reason-Verify-Refine pipeline for decision-making. The framework is evaluated on adversarial agent monitoring benchmarks, showing improved precision-recall balance over existing methods.
Key contributions
- Development of the Agent-ToM framework for monitoring LLM agents.
- Implementation of a Reason-Verify-Refine pipeline for structured decision-making.
- Demonstration of improved monitoring performance on adversarial benchmarks.
Notable insights
- The use of Theory-of-Mind reasoning to infer agent beliefs and intentions for monitoring purposes.
- The introduction of a semantic guardrail memory to retain and apply monitoring insights across episodes.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.24216v1 Announce Type: cross Abstract: Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.