The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration
Jiayuan Liu, Shiyi Du, Weihua Du, Mingyu Guo, Vincent Conitzer
Why It Matters
What makes this one worth your time
As multi-agent systems become more prevalent, ensuring their resilience against adversarial attacks is crucial for reliable AI applications.
This work introduces a novel collaboration method to enhance the robustness of multi-agent LLMs against adversarial influences.
Summary
The paper identifies a vulnerability in multi-agent LLMs that rely on response-level aggregation, proposing a Token-Level Round-Robin Collaboration method to mitigate the effects of adversarial corruptions and demonstrating its effectiveness through empirical evaluations.
Key contributions
- Identification of a critical vulnerability in response-level aggregation in multi-agent LLMs.
- Introduction of the Token-Level Round-Robin Collaboration method as a solution to adversarial majority issues.
- Empirical evaluation demonstrating the robustness of the proposed method compared to Majority Voting.
Notable insights
- The proposed Token-Level Round-Robin method shifts the aggregation process from a linear to a non-linear operator product, enhancing logical coherence.
- The theoretical framework formalizing the interleaving process as a discrete-time dynamical system provides a new perspective on agent collaboration.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2604.17139v1 Announce Type: new Abstract: Multi-agent large language model (LLM) architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model's restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.