Measuring Agents in Production

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Koushik Sen, Dawn Song, Joseph E. Gonzalez, Ion Stoica, Matei Zaharia, Marquita Ellis

Published Jun 8, 2026Featured #9In the daily list Jun 9, 2026

Open on arXiv Read PDF

Daily score63.4

Editorial review7.2

Relevance0.454

Freshness0.722

Why It Matters

What makes this one worth your time

This research provides valuable insights into the practical deployment of LLM-based agents, highlighting common practices and challenges that can inform future development and research.

A comprehensive analysis of the current state and challenges of LLM-based agents in production.

Summary

The paper presents a systematic study of LLM-based agents in production, analyzing 20 case studies and surveying 86 practitioners to understand their development, evaluation, and challenges.

Key contributions

Documentation of the current state of LLM-based agents in various industries.
Identification of key challenges faced by practitioners in agent development.
Insights into the methodologies employed in the deployment of production agents.

Notable insights

A significant majority of production agents utilize simple, controllable methods, indicating a preference for reliability over complexity.
The reliance on human evaluation suggests a gap in automated evaluation methods for agent performance.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2512.04123v4 Announce Type: replace-cross Abstract: LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.