Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
Jiajia Li, Xiaoyu Wen, Zhongtian Ma, Shuyue Hu, Qiaosheng Zhang, Zhen Wang
Why It Matters
What makes this one worth your time
This research is relevant for improving the robustness of language models in high-risk scenarios by mitigating vulnerabilities to persona-based attacks.
The paper proposes a novel adversarial self-play framework to enhance safety alignment against persona-based jailbreak attacks.
Summary
The paper introduces a framework called Persona-Invariant Alignment (PIA) to address persona-based jailbreak attacks on large language models by using adversarial self-play. It combines Persona Lineage Evolution (PLE) for attack exploration and Persona-Invariant Consistency Learning (PICL) for defense, aiming to decouple safety decisions from persona context.
Key contributions
- Introduction of the Persona-Invariant Alignment (PIA) framework.
- Development of Persona Lineage Evolution (PLE) for attack exploration.
- Implementation of Persona-Invariant Consistency Learning (PICL) for defense.
Notable insights
- The use of a unilateral KL-divergence constraint for structural decoupling of safety decisions from persona context is a clever approach.
- Lineage-based credit propagation in PLE efficiently explores high-risk persona spaces.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.01899v1 Announce Type: new Abstract: The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.