The Self-Correction Illusion: LLMs Correct Others but Not Themselves

Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

Published Jun 6, 2026

Editorial review7.2

Relevance0.478

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and addressing the self-correction limitations of LLMs can improve their reliability and performance in real-world applications.

The paper reveals that LLMs correct errors more effectively when attributed to external roles, not themselves.

Summary

The paper investigates the asymmetry in error correction capabilities of large language models (LLMs), showing that these models correct errors more effectively when the errors are attributed to external sources rather than themselves. The study uses a consistent erroneous claim across different roles and finds that relabeling the claim to an external role significantly increases correction rates. The authors propose a prompt-structure intervention to exploit this artifact without requiring model retraining.

Key contributions

Demonstrates the role-label artifact affecting error correction in LLMs.
Proposes a prompt-structure intervention to improve correction rates without retraining.

Notable insights

Relabeling the source of an error from internal to external significantly boosts correction rates in LLMs.
A prompt-structure intervention can exploit this artifact without needing model retraining.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.05976v1 Announce Type: new Abstract: Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{}, a \role{user} message, a \role{tool} response, or a \role{system } block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{} dominates on math, while a plain \role{user} message dominates on logical deduction.