Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Xinru Yan, Boxi Cao, Yaojie Lu, Hongyu Lin, Weixiang Zhou, Le Sun, Xianpei Han
Why It Matters
What makes this one worth your time
Understanding modality preference in OLLMs can lead to more balanced and trustworthy models, improving their performance in multi-modal tasks.
The paper identifies and quantifies a visual preference in omni-modal large language models, challenging the traditional text-dominance paradigm.
Summary
The paper investigates the modality preference of omni-modal large language models (OLLMs), revealing a shift from text-dominance to visual preference. It introduces a conflict-based benchmark and a modality selection rate metric to quantify this preference. The study also explores the emergence of modality preference through layer-wise probing and uses these insights to address cross-modal hallucinations, achieving competitive performance on multi-modal benchmarks.
Key contributions
- Introduction of a conflict-based benchmark and modality selection rate metric.
- Layer-wise probing to understand the emergence of modality preference.
- Application of findings to diagnose cross-modal hallucinations without task-specific data.
Notable insights
- Modality preference in OLLMs emerges progressively in mid-to-late layers.
- Internal signals from modality preference can be used to diagnose cross-modal hallucinations.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2604.16902v3 Announce Type: replace Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference