Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Qishun Yang, Shu Yang, Lijie Hu, Di Wang
Why It Matters
What makes this one worth your time
This research addresses a critical gap in aligning multimodal models with safety considerations, potentially leading to safer AI applications in sensitive environments.
VSFA leverages threat-related imagery to improve safety alignment in vision-language models.
Summary
The paper introduces Visual Self-Fulfilling Alignment (VSFA), a method for fine-tuning vision-language models on neutral visual question-answering tasks using threat-related images to enhance safety-oriented responses without requiring explicit safety labels.
Key contributions
- Introduction of the Visual Self-Fulfilling Alignment (VSFA) method.
- Demonstration of improved safety outcomes in vision-language models without the need for safety labels.
- Evaluation of the method across multiple models and safety benchmarks.
Notable insights
- The approach utilizes threat-related visual content to implicitly teach models about safety, contrasting with traditional methods that rely on explicit labels.
- The concept of extending the self-fulfilling mechanism from text to visual modalities is a novel perspective in alignment research.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2603.08486v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.