Multi-Agent Teams Hold Experts Back
Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, James Zou
Why It Matters
What makes this one worth your time
Understanding the limitations of multi-agent systems in leveraging expertise is crucial for improving collaborative AI applications in real-world scenarios.
Self-organizing LLM teams fail to utilize expert knowledge effectively, resulting in performance deficits.
Summary
The paper investigates the performance of self-organizing multi-agent LLM teams in comparison to expert agents, revealing that these teams struggle to leverage expertise effectively, leading to significant performance losses.
Key contributions
- Empirical evaluation of self-organizing LLM teams against expert performance.
- Identification of expert leveraging as a primary bottleneck in team performance.
- Conversational analysis revealing consensus-seeking behavior and its effects on performance.
Notable insights
- The tendency toward integrative compromise among agents negatively impacts performance, especially as team size increases.
- The trade-off between alignment and effective expertise utilization highlights a critical challenge in multi-agent coordination.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2602.01011v4 Announce Type: replace-cross Abstract: Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 41.1% on ML benchmarks. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.