How Far Are We from Generating Missing Modalities with Foundation Models?
Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He
Why It Matters
What makes this one worth your time
Understanding and improving missing modality reconstruction can enhance the versatility and applicability of multimodal models in real-world scenarios where incomplete data is common.
The paper proposes a novel framework for improving missing modality reconstruction using multimodal foundation models.
Summary
The paper explores the potential of multimodal foundation models for reconstructing missing modalities, identifies three paradigms for this task, and evaluates 42 model variants. It highlights deficiencies in current models and proposes an agentic framework with a self-refinement mechanism to improve reconstruction accuracy, demonstrating significant improvements in FID and MER metrics.
Key contributions
- Identification and formalization of three paradigms for missing modality reconstruction.
- Comprehensive evaluation of 42 model variants in terms of reconstruction accuracy and adaptability.
- Proposal of an agentic framework and self-refinement mechanism to improve reconstruction quality.
Notable insights
- The agentic framework dynamically formulates modality-aware mining strategies based on input context.
- A self-refinement mechanism iteratively verifies and enhances generated modalities through internal feedback.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2506.03530v3 Announce Type: replace-cross Abstract: Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.