Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition
Wanlong Fang, Tianle Zhang, Wen Tao, Alvin Chan
Why It Matters
What makes this one worth your time
Understanding modality interactions can enhance the design and deployment of multimodal language models, potentially improving their performance in complex tasks.
The paper proposes PID to dissect modality interactions in multimodal models, revealing insights into sensory and linguistic input contributions.
Summary
The paper introduces Partial Information Decomposition (PID) as a framework to analyze modality interaction in multimodal language models, distinguishing unique, redundant, and synergistic contributions of sensory and linguistic inputs. It identifies modality-use profiles across vision-language benchmarks and extends PID to tri-modal systems with Sensory PID, revealing a sensory synergy bottleneck. The study also suggests PID-guided reweighting to improve multimodal reasoning and grounding performance.
Key contributions
- Introduction of Partial Information Decomposition (PID) for analyzing modality interactions.
- Extension of PID to tri-modal systems with Sensory PID.
- PID-guided reweighting to improve multimodal reasoning and grounding performance.
Notable insights
- PID reveals recurring modality-use profiles that generalize across model families.
- Sensory PID identifies a sensory synergy bottleneck dominated by visual information in tri-modal systems.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.00959v1 Announce Type: new Abstract: Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video--audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio--visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.