Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer

Published Apr 24, 2026

Editorial review6.8

Relevance0.518

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding evaluation methods for LLM-based agents is crucial for developing robust, efficient, and safe AI systems that can effectively interact with dynamic environments.

A survey of evaluation methods for LLM-based agents across multiple perspectives.

Summary

The paper presents a comprehensive survey of evaluation methods for LLM-based agents, analyzing the field from five perspectives including core capabilities, application-specific benchmarks, generalist agent evaluation, benchmark dimensions, and developer tools.

Key contributions

Comprehensive survey of evaluation methods for LLM-based agents.
Analysis of agent evaluation across five distinct perspectives.

Notable insights

Identifies a trend towards more realistic and challenging evaluations with continuously updated benchmarks.
Highlights critical gaps in current evaluation methods, particularly in cost-efficiency, safety, and robustness.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2503.16416v2 Announce Type: replace Abstract: LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application-specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks' core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address, particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, scalable evaluation methods.