Multimodal Vision Language Alignment Safety

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu, Di Wang

Published Apr 17, 2026Featured #4In the daily list Apr 18, 2026

Open on arXiv Read PDF

Daily score65.0

Editorial review7.5

Relevance0.486

Freshness0.722

Why It Matters

What makes this one worth your time

This research addresses a critical gap in aligning multimodal models with safety considerations, potentially leading to safer AI applications in sensitive environments.

VSFA leverages threat-related imagery to improve safety alignment in vision-language models.

Summary

The paper introduces Visual Self-Fulfilling Alignment (VSFA), a method for fine-tuning vision-language models on neutral visual question-answering tasks using threat-related images to enhance safety-oriented responses without requiring explicit safety labels.

Key contributions

Introduction of the Visual Self-Fulfilling Alignment (VSFA) method.
Demonstration of improved safety outcomes in vision-language models without the need for safety labels.
Evaluation of the method across multiple models and safety benchmarks.

Notable insights

The approach utilizes threat-related visual content to implicitly teach models about safety, contrasting with traditional methods that rely on explicit labels.
The concept of extending the self-fulfilling mechanism from text to visual modalities is a novel perspective in alignment research.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2603.08486v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.