The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge
Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer
Why It Matters
What makes this one worth your time
Understanding the alignment between confidence signals and reasoning quality can improve the evaluation and reliability of multi-agent debate systems, which are crucial for applications requiring complex decision-making and reasoning.
The study explores how internal confidence signals relate to reasoning quality in multi-agent debates.
Summary
The paper investigates the relationship between internal confidence signals and reasoning quality in multi-agent debate systems, using token-level log-probabilities and LLM-as-judge rubric scores across three domains. It introduces a two-agent debate framework with a Constructor and an Auditor, evaluated by an LLM-as-judge, and finds that confidence aligns more strongly with reasoning quality for the Constructor than for the Auditor.
Key contributions
- Introduces a framework pairing a two-agent debate architecture with an LLM-as-judge.
- Analyzes the relationship between log-probabilities, rubric scores, and task accuracy in multi-agent debates.
Notable insights
- Confidence aligns with judged reasoning quality more strongly for the Constructor than for the Auditor.
- Confidence-based detection of critical reasoning failures is more reliable for the Constructor.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.10296v1 Announce Type: cross Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.