Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li

Published Jul 13, 2026Featured #10In the daily list Jul 14, 2026

Open on arXiv Read PDF

Daily score53.6

Editorial review6.8

Relevance0.452

Freshness0.722

Why It Matters

What makes this one worth your time

This framework could improve the adaptability and accuracy of reasoning LLMs, which is crucial for applications requiring nuanced understanding and decision-making.

Latent Reward Steering enhances reasoning in LLMs by optimizing latent states for better cognitive behavior.

Summary

The paper introduces Latent Reward Steering (LRS), an adaptive inference-time framework designed to enhance cognitive behaviors in reasoning large language models (LLMs). LRS optimizes sparse-autoencoder latent states to implicitly promote cognitive behaviors by training a latent reward model on reasoning traces. This model estimates the quality of intermediate latent states, providing state-specific corrections during inference. Experiments demonstrate improved performance over various baselines, suggesting that LRS can fix reasoning errors without predefined cognitive behaviors.

Key contributions

Introduction of Latent Reward Steering (LRS) for adaptive inference-time cognitive behavior promotion.
Development of a latent reward model trained on reasoning traces to guide state-specific corrections.

Notable insights

LRS uses a latent reward model to estimate the quality of intermediate latent states, allowing for adaptive corrections.
The framework avoids reliance on predefined cognitive behaviors, potentially increasing its flexibility across different tasks and models.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.00726v2 Announce Type: replace Abstract: Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.