Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan

Published May 1, 2026Featured #9In the daily list Apr 25, 2026

Open on arXiv Read PDF

Daily score64.4

Editorial review6.8

Relevance0.505

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding and overcoming the integration challenges in multimodal reasoning can lead to more effective AI systems that leverage diverse data sources.

The paper identifies and addresses foundational bottlenecks in multimodal reasoning through a novel evaluation framework.

Summary

The paper investigates the challenges in multimodal reasoning by proposing a logic-grounded evaluation framework that categorizes reasoning into six interaction patterns. It identifies two core failures in multimodal reasoning: task-composition bottleneck and fusion bottleneck, and suggests composition-aware training and early fusion control as solutions.

Key contributions

Proposed a logic-grounded evaluation framework for multimodal reasoning.
Identified task-composition and fusion bottlenecks as key failures in multimodal reasoning.
Demonstrated that a two-step prompting approach can restore performance by addressing the task-composition bottleneck.

Notable insights

Additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths.
Softening attention in early fusion improves reasoning by mitigating biased fusion.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2509.23744v4 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.