LLM Reasoning Evaluation Interpretability

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

Tianxi Gao, Yufan Cai, Yusi Yuan, Jin Song Dong

Published Jun 3, 2026Featured #2In the daily list Jun 4, 2026

Open on arXiv Read PDF

Daily score73.2

Editorial review7.5

Relevance0.466

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding LLM reasoning capabilities is crucial for improving model design and ensuring reliable performance in complex tasks.

X-RAY provides a novel framework for evaluating LLM reasoning through formalized probes.

Summary

The paper introduces X-RAY, a system for analyzing the reasoning capabilities of large language models (LLMs) using calibrated and formally verified probes, revealing insights into their performance across various scientific domains.

Key contributions

Development of X-RAY, an explainable reasoning analysis system for LLMs.
Introduction of calibrated, formally verified probes to evaluate reasoning capabilities.
Identification of failure modes in LLMs that are structurally interpretable.

Notable insights

The framework isolates structural information in reasoning tasks, allowing for a nuanced understanding of LLM performance.
The systematic asymmetry in LLM reasoning highlights specific vulnerabilities that can inform future model training and evaluation.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2603.05290v2 Announce Type: replace Abstract: Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{structure}, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.