LLM Agent Reasoning Evaluation Benchmark

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

Fabrizio Marozzo, Pietro Li\`o

Published Jun 12, 2026

Open on arXiv Read PDF

Editorial review7.0

Relevance0.508

Freshness0.521

Why It Matters

What makes this one worth your time

This research is relevant for AI engineers and researchers interested in improving the robustness and accuracy of LLMs in technical problem-solving contexts, particularly in reducing biases introduced by user assumptions.

The paper introduces an evidence-first approach to enhance LLM problem diagnosis by mitigating user-driven biases.

Summary

The paper proposes an evidence-first methodology for problem diagnosis using large language models, addressing the issue of user-driven sycophancy by implementing a Solution Investigator Agent that evaluates problem descriptions, generates hypotheses, and iteratively refines them through targeted questioning. The approach is tested on a benchmark of technical forum threads, showing improved diagnostic accuracy over standard and reasoning-oriented LLMs.

Key contributions

Introduction of an evidence-first methodology for LLM-based problem diagnosis.
Development of a Solution Investigator Agent that refines hypotheses through targeted questioning.
Creation of a benchmark for evaluating diagnostic accuracy in technical domains.

Notable insights

The use of a Solution Investigator Agent to iteratively refine hypotheses based on evidence rather than initial user input.
The creation of a benchmark from solved technical forum threads to evaluate diagnostic accuracy.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.13220v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.