Back to top papers

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang

Score8.500
LLMn/a
Embedding0.454
Recencyn/a

Feedback

Why It Matters

This research highlights the limitations of current tool-use benchmarks by demonstrating that agents struggle more with navigation than tool execution, prompting a reevaluation of how we assess AI capabilities.

Contributions

  • Introduction of a new benchmark (AAR) with DAG puzzles for better evaluation of LLM agents' tool-use and navigation skills.

Insights

  • Agents excel at tool execution but falter in navigating complex information structures.

Limitations

  • The benchmark may require extensive computational resources for evaluation due to its complexity.

Tags

  • agent
  • benchmark
  • evaluation
  • llm

Abstract

arXiv:2604.10261v1 Announce Type: new Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race