LLM Agent Reasoning Evaluation Benchmark

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Kevin Wang, Anna Th\"oni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Lauri\`ere, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

Published May 29, 2026

Open on arXiv Read PDF

Editorial review7.0

Relevance0.512

Freshness0.041

Why It Matters

What makes this one worth your time

Understanding and improving the social and strategic reasoning of LLMs is crucial for their deployment in real-world multi-agent scenarios, making this evaluation platform a valuable tool for researchers and developers.

Mindgames is a novel platform for evaluating multi-agent reasoning in LLMs through diverse game environments.

Summary

The paper introduces Mindgames, a multi-game arena and evaluation platform designed to assess the social and strategic reasoning capabilities of large language models (LLMs) in multi-agent settings. It evaluates LLM agents across four games, highlighting both agent-level and evaluation-level limitations, and provides a dataset of multi-agent games for further research.

Key contributions

Introduction of Mindgames, a multi-game evaluation platform for LLMs.
Analysis of agent-level and evaluation-level limitations in multi-agent reasoning.
Release of a large dataset of multi-agent games with detailed observations and actions.

Notable insights

The use of a multi-game arena allows for a more comprehensive evaluation of LLMs' reasoning capabilities compared to static benchmarks.
The analysis reveals that environments with high failure rates may skew results by rewarding robustness over strategic skill.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.