Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows
Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin
Why It Matters
What makes this one worth your time
Understanding the efficiency and effectiveness of multi-agent systems in LLM workflows can guide the development of more cost-effective and accurate AI systems.
BenchAgent evaluates the effectiveness of single-agent versus multi-agent workflows in LLMs under a unified protocol.
Summary
The paper introduces BenchAgent, an evaluation framework for comparing single-agent and multi-agent system (MAS) workflows in large language models (LLMs) under a unified protocol. It evaluates these workflows across ten benchmarks using GPT-4.1, and reports on a Protocol-Aligned External (PAE) GAIA study. The results show that only one of the tested MAS, EvoAgent, performs comparably to a single-agent system, while others fall behind. A runtime workflow, Claude-Code, outperforms a fixed MAS baseline in a separate evaluation.
Key contributions
- Development of BenchAgent, a framework for evaluating LLM workflows.
- Comparison of single-agent and multi-agent systems under a controlled protocol.
- Introduction of a Protocol-Aligned External (PAE) GAIA study for runtime-generated workflows.
Notable insights
- The introduction of a unified protocol for evaluating LLM workflows could standardize comparisons across different agent configurations.
- The study suggests that more agents do not necessarily lead to better performance, highlighting the importance of efficient agent design.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.05670v1 Announce Type: new Abstract: Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.