Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li
Why It Matters
What makes this one worth your time
This work could significantly enhance the performance of reasoning LLMs in diverse tasks, making them more robust and adaptable in real-world applications.
LRS offers a novel approach to improve reasoning in LLMs by optimizing latent states adaptively.
Summary
The paper introduces Latent Reward Steering (LRS), an adaptive framework that optimizes latent states in reasoning LLMs to enhance cognitive behaviors during inference, addressing limitations of existing explicit control methods.
Key contributions
- Development of the Latent Reward Steering framework for adaptive inference-time cognitive behavior promotion.
- Introduction of a latent reward model trained on reasoning traces for state quality estimation.
- Implementation of a reward and confidence gate to selectively intervene in fragile latent states.
Notable insights
- LRS leverages a latent reward model based on reasoning traces, which is a novel approach to estimate the quality of intermediate states.
- The use of a reward and confidence gate to restrict interventions to fragile states is an interesting method for targeted improvements.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2606.00726v1 Announce Type: new Abstract: Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.