Back to today's list

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li, Hao Wang, Wendi Feng, Yesheng Liu, Lijun Li, Jin-Ge Yao, Lei Sha

Published Jun 2, 2026
Editorial review7.2
Relevance0.496
Freshness0.000

Why It Matters

What makes this one worth your time

This work is relevant for AI engineers and researchers seeking efficient methods to align language models with human values without significantly degrading their general capabilities.

SafeSteer offers a localized approach to align language models with human values, minimizing alignment costs.

Summary

The paper introduces SafeSteer, a method for aligning large language models with human values by focusing on localized modifications rather than global trade-offs. It uses on-policy distillation confined to safety tokens, employing a safety teacher constructed via activation steering and a safety token selection algorithm. The approach aims to maintain general capabilities while enhancing safety performance, requiring significantly fewer harmful samples than previous methods.

Key contributions

  • Introduction of SafeSteer, a localized on-policy distillation method for safety alignment.
  • Development of a safety token selection algorithm based on a safety teacher.
  • Demonstration of reduced alignment cost by using only 100 harmful samples.

Notable insights

  • The use of activation steering to construct a safety teacher is a novel approach to guide safety alignment.
  • Focusing on safety tokens rather than global trade-offs could reduce the alignment tax while preserving model capabilities.

Possible limitations

  • Not stated in the abstract

Abstract

arXiv:2606.02530v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.