FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games
Leonardo Bertolazzi, Katya Tentori, Raffaella Bernardi
Why It Matters
What makes this one worth your time
This research provides insights into the reasoning capabilities of LLMs, which is crucial for their deployment in scientific tasks and understanding their limitations.
FALSIFYBENCH evaluates LLMs' inductive reasoning through a novel hypothesis-driven framework.
Summary
The paper introduces FALSIFYBENCH, an evaluation framework for assessing inductive reasoning in large language models (LLMs) through a hypothesis-driven task inspired by the Wason 2-4-6 task, demonstrating that models capable of negative testing outperform those seeking confirmation.
Key contributions
- Introduction of the FALSIFYBENCH evaluation framework for inductive reasoning in LLMs.
- Empirical evaluation of 12 LLMs across different families and scales, highlighting the performance differences based on reasoning strategies.
- Identification of key factors influencing the success of LLMs in hypothesis testing.
Notable insights
- The framework emphasizes the importance of negative testing in hypothesis-driven reasoning, which has not been a primary focus in previous evaluations.
- The fine-grained turn-level analysis reveals specific patterns of failure in LLMs, offering a pathway for targeted improvements.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2606.04751v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.