HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs
Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli Bai, Xin Shen, Farzan Farnia
Why It Matters
What makes this one worth your time
This work is relevant for AI researchers and engineers focused on improving the accuracy and efficiency of mathematical reasoning in LLMs, offering a novel approach to combine informal and formal reasoning paradigms.
Hermes enhances LLM mathematical reasoning by integrating informal reasoning with formal verification.
Summary
The paper introduces Hermes, a tool-assisted agent that combines informal reasoning with formal verification in Lean to enhance mathematical reasoning in large language models (LLMs). Hermes aims to improve reasoning accuracy and efficiency by interleaving informal reasoning with formal proofs, using intermediate formal checks and a memory module for continuity. The framework is evaluated on challenging mathematical benchmarks, showing significant improvements in accuracy and computational efficiency.
Key contributions
- Introduction of Hermes, a framework that combines informal reasoning with formal verification in Lean.
- Demonstrated significant improvements in reasoning accuracy and computational efficiency on challenging benchmarks.
- Public release of the implementation and codebase for broader use and validation.
Notable insights
- Interleaving informal reasoning with formal verification can prevent reasoning drift and improve accuracy.
- A memory module can maintain proof continuity across multi-step reasoning chains, enhancing exploration and verification.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2511.18760v2 Announce Type: replace Abstract: Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler, but lacks the exploratory freedom of informal problem-solving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proofs in Lean. The framework performs intermediate formal checking to prevent reasoning drift and a memory module for proof continuity across multi-step reasoning chains, enabling both exploration and verification. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing reasoning token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME and HARDMath2, Hermes@1 achieves up to a 40% accuracy improvement while using 80% fewer total inference FLOPs. When scaled at test time, Hermes@5 boosts accuracy further by 20%. The implementation and codebase are publicly available at https://github.com/aziksh-ospanov/HERMES.