The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang

Score8.500

LLMn/a

Embedding0.454

Recencyn/a

Feedback

Why It Matters

This research highlights the limitations of current tool-use benchmarks by demonstrating that agents struggle more with navigation than tool execution, prompting a reevaluation of how we assess AI capabilities.

Contributions

Introduction of a new benchmark (AAR) with DAG puzzles for better evaluation of LLM agents' tool-use and navigation skills.

Insights

Agents excel at tool execution but falter in navigating complex information structures.

Limitations

The benchmark may require extensive computational resources for evaluation due to its complexity.

Abstract

arXiv:2604.10261v1 Announce Type: new Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

arXiv PDF