Don't Make the LLM Read the Graph: Make the Graph Think

Yuqi Sun, Tianqin Meng, George Liu, Yashraj Panwar, Lakshya Chaudhry, Munasib Ilham, Aman Chadha

Published Jun 8, 2026Featured #7In the daily list Jun 9, 2026

Open on arXiv Read PDF

Daily score66.4

Editorial review7.2

Relevance0.496

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding how belief graphs can improve LLM performance in cooperative reasoning tasks could lead to more effective AI systems in complex, multi-agent environments.

Integrating belief graphs with LLMs can enhance multi-agent reasoning in cooperative tasks.

Summary

The paper explores the integration of explicit belief graphs to enhance the performance of large language models (LLMs) in cooperative multi-agent reasoning tasks, specifically in the game Hanabi. It evaluates different integration architectures and identifies conditions under which belief graphs are beneficial, highlights a model-specific failure termed 'Planner Defiance,' and provides evidence that inter-agent conventions outperform single-agent interventions. Additionally, it examines the cost-benefit ratio of graph depth.

Key contributions

Evaluation of belief graph integration architectures in LLMs for cooperative reasoning.
Identification of 'Planner Defiance,' a model-specific failure mode.
Analysis of graph depth's impact on performance and cost-benefit ratio.

Notable insights

Belief graphs are structurally essential for strong models when used to gate action selection.
Inter-agent conventions significantly outperform single-agent interventions in cooperative settings.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2604.23057v2 Announce Type: replace Abstract: We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd-order Theory of Mind (80% vs 10%, p<0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd-order ToM, p<0.001). Second, we identify "Planner Defiance," a model-family-specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near-zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full-game evidence confirms inter-agent conventions (+128% over baseline, p=0.003) outperform all single-agent interventions, and individual belief-graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost-benefit ratio, while deeper ToM graphs appear harmful at larger player counts (-1.5 pts at 5-player, p=0.029).