MESA: Improving MoE Safety Alignment via Decentralized Expertise

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei

Published Jun 2, 2026Featured #10In the daily list Jun 3, 2026

Open on arXiv Read PDF

Daily score67.4

Editorial review7.5

Relevance0.451

Freshness0.722

Why It Matters

What makes this one worth your time

As large language models become more prevalent, ensuring their safety and alignment is crucial to prevent misuse and enhance their reliability in real-world applications.

MESA decentralizes safety in MoE architectures to enhance robustness against adversarial threats.

Summary

The paper introduces MESA, a framework designed to enhance the safety alignment of Mixture-of-Experts (MoE) architectures in large language models by decentralizing safety responsibilities and optimizing expert utilization through mechanisms based on Optimal Transport theory.

Key contributions

Development of the MESA framework for safety alignment in MoE architectures.
Introduction of Expert Capacity Reallocation and Dynamic Routing Refinement mechanisms.
Demonstration of robust defensive performance against harmful benchmarks while maintaining model helpfulness.

Notable insights

The concept of Safety Sparsity highlights a critical vulnerability in MoE architectures that has not been extensively addressed in existing literature.
Utilizing Optimal Transport theory for expert capacity reallocation presents a novel approach to optimizing resource allocation in machine learning models.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2606.00651v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.