Towards a Science of AI Agent Reliability
Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan
Why It Matters
What makes this one worth your time
Understanding and improving AI agent reliability is crucial for their deployment in safety-critical applications, where failures can have serious consequences.
This work introduces a comprehensive framework for assessing AI agent reliability beyond traditional metrics.
Summary
The paper proposes a set of twelve metrics to evaluate AI agent reliability across four dimensions: consistency, robustness, predictability, and safety, revealing that recent improvements in AI capabilities have not significantly enhanced reliability.
Key contributions
- Introduction of twelve concrete metrics for evaluating AI agent reliability.
- Decomposition of reliability into four key dimensions: consistency, robustness, predictability, and safety.
- Empirical evaluation of 15 models across two benchmarks to assess the proposed metrics.
Notable insights
- The proposed metrics provide a nuanced view of agent performance that traditional success metrics overlook.
- The findings suggest that despite advancements in AI capabilities, reliability remains a significant challenge.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2602.16666v3 Announce Type: replace Abstract: AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 15 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.