The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment
Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert \c{C}elikok, Lukas Galke Poech
Why It Matters
What makes this one worth your time
As AI systems increasingly rely on multi-agent interactions, ensuring alignment among agents is crucial for reliable decision-making and task execution.
The Arbiter agent monitors multi-agent conversations to detect misalignment in real-time.
Summary
The paper introduces the Arbiter, an agent designed to monitor multi-agent conversations in real-time to detect emergent misalignment among participants. The Arbiter operates under a limited inspection budget and uses various tools to identify misaligned behavior, producing a report at the end of the conversation. The evaluation shows that active inspection tools enhance detection accuracy and speed, with different types of misalignment presenting varying levels of detection difficulty.
Key contributions
- Introduction of the Arbiter agent for monitoring multi-agent conversations.
- Evaluation of the Arbiter across different conversation conditions and tool configurations.
- Identification of the challenges in detecting different types of misalignment.
Notable insights
- Active inspection tools improve both the accuracy and speed of detecting misalignment.
- Weight-induced misalignment is harder to detect than instruction-induced misalignment.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.10747v1 Announce Type: new Abstract: As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.