Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Nesreen K. Ahmed, Nima Nafisi

Published May 26, 2026

Editorial review7.2

Relevance0.515

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and monitoring the intentions and beliefs of autonomous agents is crucial for ensuring their safe and reliable deployment in real-world applications.

Agent-ToM uses Theory-of-Mind reasoning to enhance monitoring of autonomous LLM agents.

Summary

The paper introduces Agent-ToM, a framework for monitoring autonomous large language model agents using Theory-of-Mind reasoning to detect covert malicious behavior. It performs structured analysis of agent trajectories by inferring beliefs and intentions, and employs a Reason-Verify-Refine pipeline for decision-making. The framework is evaluated on adversarial agent monitoring benchmarks, showing improved precision-recall balance over existing methods.

Key contributions

Development of the Agent-ToM framework for monitoring LLM agents.
Implementation of a Reason-Verify-Refine pipeline for structured decision-making.
Demonstration of improved monitoring performance on adversarial benchmarks.

Notable insights

The use of Theory-of-Mind reasoning to infer agent beliefs and intentions for monitoring purposes.
The introduction of a semantic guardrail memory to retain and apply monitoring insights across episodes.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.24216v1 Announce Type: cross Abstract: Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.