The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration

Jiayuan Liu, Shiyi Du, Weihua Du, Mingyu Guo, Vincent Conitzer

Published Apr 21, 2026

Editorial review7.5

Relevance0.453

Freshness0.000

Why It Matters

What makes this one worth your time

As multi-agent systems become more prevalent, ensuring their resilience against adversarial attacks is crucial for reliable AI applications.

This work introduces a novel collaboration method to enhance the robustness of multi-agent LLMs against adversarial influences.

Summary

The paper identifies a vulnerability in multi-agent LLMs that rely on response-level aggregation, proposing a Token-Level Round-Robin Collaboration method to mitigate the effects of adversarial corruptions and demonstrating its effectiveness through empirical evaluations.

Key contributions

Identification of a critical vulnerability in response-level aggregation in multi-agent LLMs.
Introduction of the Token-Level Round-Robin Collaboration method as a solution to adversarial majority issues.
Empirical evaluation demonstrating the robustness of the proposed method compared to Majority Voting.

Notable insights

The proposed Token-Level Round-Robin method shifts the aggregation process from a linear to a non-linear operator product, enhancing logical coherence.
The theoretical framework formalizing the interleaving process as a discrete-time dynamical system provides a new perspective on agent collaboration.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.17139v1 Announce Type: new Abstract: Multi-agent large language model (LLM) architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model's restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.