Curriculum Learning for Safety Alignment

Sandeep Kumar, Virginia Smith, Chhavi Yadav

Published May 27, 2026

Editorial review6.8

Relevance0.467

Freshness0.000

Why It Matters

What makes this one worth your time

Improving the robustness of safety alignment in language models is crucial for deploying them in real-world applications where safety and reliability are paramount.

Curriculum Learning improves safety alignment robustness in language models.

Summary

The paper explores the use of Curriculum Learning to enhance the robustness of Direct Preference Optimisation (DPO) for safety alignment in large language models. It introduces a framework called Staged-Competence, which organizes preference data by difficulty and updates the reference model progressively. The approach reportedly reduces out-of-distribution harmful response rates and jailbreak attack success rates while maintaining general capabilities.

Key contributions

Introduction of the Staged-Competence framework for curriculum-based safety alignment.
Demonstration of reduced harmful response and jailbreak attack rates using curriculum learning.
Evidence that Staged-Competence can achieve baseline safety with reduced training data.

Notable insights

Organizing preference data by difficulty can enhance model robustness against out-of-distribution scenarios.
Progressive updating of the reference model during training can reduce harmful response rates and improve safety alignment.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.26315v1 Announce Type: cross Abstract: Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.