Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

Published May 29, 2026

Editorial review7.2

Relevance0.498

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and enhancing the robustness of safety alignment in LLMs is crucial for developing safer AI systems that can withstand various perturbations without compromising their utility.

This work introduces a novel optimizer-centric approach to improve the robustness of safety alignment in LLMs.

Summary

The paper investigates the robustness of safety alignment in large language models by focusing on the base optimizer, proposing a hybrid framework that combines first-order safety alignment with zeroth-order optimization to enhance robustness against perturbations.

Key contributions

Introduces a hybrid framework that integrates first-order safety alignment with zeroth-order refinement.
Demonstrates that a few steps of zeroth-order refinement can significantly enhance robustness while maintaining safety alignment.
Develops a method to estimate layer-wise robustness sensitivity to optimize the refinement process.

Notable insights

The paper highlights the fragility of safety alignment in LLMs and the overlooked role of the optimizer in enhancing robustness.
Zeroth-order optimization is proposed as a method to evaluate and refine safety alignment under perturbations, which is a novel perspective in this context.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.29396v1 Announce Type: new Abstract: Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.