Efficiently Aligning Language Models with Online Natural Language Feedback
Christine Ye, Joe Benton
Why It Matters
What makes this one worth your time
This research addresses the challenge of training language models in complex domains with limited expert input, which is crucial for practical AI applications.
The study enhances language model alignment using efficient online feedback mechanisms.
Summary
The paper presents methods for aligning language models using online natural language feedback, focusing on training in fuzzy domains where expert supervision is limited, and demonstrates improved data efficiency through proxy reward signals.
Key contributions
- Development of methods for aligning language models with online natural language feedback.
- Demonstration of improved data efficiency in model training using proxy rewards.
- Empirical results showing significant performance recovery with fewer expert samples.
Notable insights
- The use of in-context learning (ICL) and fine-tuning to construct proxy reward models is a clever approach to optimize data efficiency.
- The iterative optimization process that stops at over-optimization is a nuanced strategy that could prevent model degradation.
Possible limitations
- The abstract does not address potential challenges in generalizing the methods across different domains or the scalability of the approach.
- Not stated in the abstract.
Abstract
arXiv:2605.04356v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, broadly beneficial deployments of AI may require us to train models with strong capabilities in "fuzzy", hard-to-supervise domains. In this paper, we develop methods to align language models in fuzzy domains where human experts are still able to provide high-quality supervision signal, but only for a small number of model outputs, using online natural language feedback. Specifically, we train models by iteratively optimizing against proxy reward signals, stopping at the point of over-optimization, collecting fresh expert supervision, and updating the proxy reward. We construct proxy reward models from language models using in-context learning (ICL) and fine-tuning. We test our methods by eliciting creative writing and alignment research capabilities in Qwen3-8B and Haiku 4.5 respectively. For Qwen3-8B, ICL methods recover up to 35% of performance with 50x fewer expert samples, while fine-tuning methods recover 80% with up to 20x fewer samples and 100% with 3x fewer samples. For Haiku 4.5, ICL methods recover up to 35% of performance with 30x fewer samples, and fine-tuning methods recover 100% with 10x fewer samples. Our results suggest that online natural language feedback can substantially improve the data efficiency of expert supervision.