CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning

Dong Li, Lingling Zhang, Binghao Han, Linlin Ding, Yue Kou

Published Jun 3, 2026

Editorial review6.8

Relevance0.475

Freshness0.000

Why It Matters

What makes this one worth your time

The approach addresses the challenge of missing or uncertain modality inputs in real-world multimodal data fusion, potentially improving decision-making processes in complex environments.

CL-DMDF enhances multimodal data fusion with contrastive learning and dynamic attention mechanisms.

Summary

The paper proposes a Dynamic Multimodal Data Fusion model (CL-DMDF) that uses contrastive learning to handle uncertain or missing modality inputs in multimodal data fusion tasks. It introduces a novel attention mechanism and an entity-centroid contrastive learning module to improve the reliability and discriminative learning of multimodal representations.

Key contributions

Introduction of a novel attention mechanism for multimodal data fusion.
Development of an entity-centroid contrastive learning module.
Implementation of an adaptive fusion module for dynamic fusion strategies.

Notable insights

The use of an entity-centroid contrastive learning module to construct centroid-based positive samples is a clever way to enhance discriminative learning.
The novel attention mechanism that operates across both feature and modality dimensions is a non-obvious approach to compute reliable attention scores.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.02659v1 Announce Type: cross Abstract: Multimodal data fusion involves integrating and analyzing information from multiple modalities to uncover latent correlations and complementary patterns, thereby enhancing data processing and decision-making. While existing methods for structured multimodal inputs are typically designed around specific tasks and assume fully observed modalities, real-world applications often suffer from uncertain or missing modality inputs due to various factors. Some traditional models overly emphasize local interactions within missing modalities, neglecting the global complementary cues embedded in multimodal representations. To overcome these limitations, we propose a Dynamic Multimodal Data Fusion model based on Contrastive Learning (CL-DMDF). CL-DMDF introduces a novel attention mechanism that operates across both feature and modality dimensions to compute reliable attention scores, effectively reflecting importance at each level. The CL-DMDF further incorporates an entity-centroid contrastive learning module that constructs centroid-based positive samples from entity features to enhance discriminative learning. Additionally, an adaptive fusion module is employed to improve the efficiency and accuracy of dynamic fusion strategies. Extensive experiments conducted on three datasets demonstrate the effectiveness of the CL-DMDF across diverse multimodal fusion tasks.