A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

Yuxiang Chen, Jun Wang

Published Jun 8, 2026

Editorial review6.8

Relevance0.482

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding the differences between human and AI reasoning can guide the development of more effective AI models that genuinely reason rather than mimic reasoning.

The paper analyzes and compares human and AI reasoning in mathematical problem-solving, highlighting structural differences and suggesting improvements.

Summary

The paper conducts an empirical comparison between human and AI reasoning, specifically focusing on the DeepSeek-R1 LLM, by analyzing reasoning steps in mathematical problems from AIME 2025. It identifies structural differences in reasoning processes and suggests improvements for evaluating and training AI models.

Key contributions

Comprehensive empirical comparison of human and AI reasoning on mathematical problems.
Identification of structural differences in reasoning processes between humans and AI.
Suggestions for improving AI reasoning evaluation and training.

Notable insights

Successful reasoning traces in AI models exhibit stable branching and backtracking, while failures are linked to improper use of exploratory actions.
Reflection is effective only when integrated within deductive inference, rather than being trapped in analysis loops.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.07410v1 Announce Type: cross Abstract: The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inference, Branch, Backtrace, and Reflection. We find a clear structural difference. Human solutions maintain a compact alternation between analysis and deduction, whereas DeepSeek-R1 frequently revisits intermediate results, performs shallow and often unnecessary verification, and loops through local checks without meaningful logical progress. We describe this as topological mimicry: reproducing the surface form of reasoning without its functional role. Despite this, we identify two signals of genuine reasoning. First, successful traces exhibit stable use of branching and backtracking, while failed traces either underuse or overuse exploratory actions. Second, reflection is only effective when placed within deductive inference; reflections trapped in analysis loops focus on local numerical details while missing global logical errors. These findings suggest that current long-CoT models may be rewarded more for the appearance of reasoning than for genuine deductive progress. We discuss directions for improving evaluation and training, including measuring cross-trace stability, penalising "spinning-wheel" traces, encouraging deeper logical correction, and reallocating inference-time compute toward deduction and backtracking. Overall, reasoning quality depends not simply on how much reflection occurs, but on whether reflection appears consistently and at the appropriate logical scale.