Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution

Tarun Kota

Published Jun 1, 2026

Editorial review6.8

Relevance0.469

Freshness0.112

Why It Matters

What makes this one worth your time

Improving oracle accuracy in prediction markets can enhance decision-making processes by providing more reliable forecasts, which is crucial for fields relying on accurate predictions.

Multi-agent LLM architectures are explored to enhance prediction market oracle accuracy, with mixed results.

Summary

The paper investigates the use of multi-agent LLM architectures to improve the accuracy of oracle systems in prediction markets, comparing independent aggregation and deliberative consensus methods against single-LLM baselines. It finds that independent aggregation slightly outperforms single models, while deliberative consensus degrades performance. The study suggests a hybrid AI-human system for better accuracy.

Key contributions

Evaluation of multi-agent LLM architectures for oracle systems in prediction markets.
Comparison of independent aggregation and deliberative consensus methods against single-LLM baselines.
Proposal of a hybrid AI-human oracle system for improved accuracy.

Notable insights

Independent aggregation with confidence-weighted voting slightly improves accuracy over single models.
Error propagation in deliberative consensus can lead to decreased accuracy, highlighting the challenge of error correlation in multi-agent systems.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.30802v1 Announce Type: cross Abstract: Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single-LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self-correction mechanism. We evaluate whether multi-agent LLM architectures can improve oracle resolution accuracy over single-model baselines. We compare independent aggregation and deliberative consensus against single-LLM baselines (GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B) on 1,189 resolved prediction market questions from KalshiBench. All agents share a common evidence layer through Exa, with retrieval filtered by publication date to isolate reasoning from retrieval quality. Independent aggregation with confidence-weighted voting achieves the highest accuracy at 83.43 percent, outperforming the best individual model by 1.01 percentage points. Deliberative consensus degrades accuracy to approximately 76 percent, below every single-model baseline, attributed to error propagation during debate where confidently wrong models flip correct ones. Error correlations across models (0.529-0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling, placing a fundamental limit on ensemble approaches. Many questions resist correction by any multi-agent architecture, motivating escalation to human arbitration. We propose routing criteria for hybrid AI-human oracle systems: auto-resolving only unanimous, high-confidence questions yields 97.87 percent accuracy on 47 percent of the dataset, with inter-agent disagreement flagging the remainder for human review.