Back to today's list

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu, Di Wang

Published Apr 17, 2026Featured #4In the daily list Apr 18, 2026
Daily score65.0
Editorial review7.5
Relevance0.486
Freshness0.722

Why It Matters

What makes this one worth your time

This research addresses a critical gap in aligning multimodal models with safety considerations, potentially leading to safer AI applications in sensitive environments.

VSFA leverages threat-related imagery to improve safety alignment in vision-language models.

Summary

The paper introduces Visual Self-Fulfilling Alignment (VSFA), a method for fine-tuning vision-language models on neutral visual question-answering tasks using threat-related images to enhance safety-oriented responses without requiring explicit safety labels.

Key contributions

  • Introduction of the Visual Self-Fulfilling Alignment (VSFA) method.
  • Demonstration of improved safety outcomes in vision-language models without the need for safety labels.
  • Evaluation of the method across multiple models and safety benchmarks.

Notable insights

  • The approach utilizes threat-related visual content to implicitly teach models about safety, contrasting with traditional methods that rely on explicit labels.
  • The concept of extending the self-fulfilling mechanism from text to visual modalities is a novel perspective in alignment research.

Possible limitations

  • Not stated in the abstract.

Abstract

arXiv:2603.08486v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.