LLM Multimodal Vision Language Evaluation Benchmark

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Xinru Yan, Boxi Cao, Yaojie Lu, Hongyu Lin, Weixiang Zhou, Le Sun, Xianpei Han

Published Apr 30, 2026

Open on arXiv Read PDF

Editorial review7.0

Relevance0.488

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding modality preference in OLLMs can lead to more balanced and trustworthy models, improving their performance in multi-modal tasks.

The paper identifies and quantifies a visual preference in omni-modal large language models, challenging the traditional text-dominance paradigm.

Summary

The paper investigates the modality preference of omni-modal large language models (OLLMs), revealing a shift from text-dominance to visual preference. It introduces a conflict-based benchmark and a modality selection rate metric to quantify this preference. The study also explores the emergence of modality preference through layer-wise probing and uses these insights to address cross-modal hallucinations, achieving competitive performance on multi-modal benchmarks.

Key contributions

Introduction of a conflict-based benchmark and modality selection rate metric.
Layer-wise probing to understand the emergence of modality preference.
Application of findings to diagnose cross-modal hallucinations without task-specific data.

Notable insights

Modality preference in OLLMs emerges progressively in mid-to-late layers.
Internal signals from modality preference can be used to diagnose cross-modal hallucinations.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2604.16902v3 Announce Type: replace Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference