MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He, Yaxin Xue

Published Jun 8, 2026Featured #4In the daily list Jun 9, 2026

Open on arXiv Read PDF

Daily score71.0

Editorial review7.5

Relevance0.452

Freshness0.722

Why It Matters

What makes this one worth your time

This research addresses critical challenges in multimodal intent recognition, making it relevant for applications requiring robust understanding in noisy environments.

MVCL-DAF++ advances multimodal intent recognition through innovative alignment and attention techniques.

Summary

The paper introduces MVCL-DAF++, an enhancement to multimodal intent recognition that incorporates prototype-aware contrastive alignment and coarse-to-fine attention fusion to improve semantic grounding and robustness, particularly for rare-class recognition.

Key contributions

Introduction of prototype-aware contrastive alignment for improved semantic grounding.
Development of coarse-to-fine attention fusion for better hierarchical cross-modal interaction.
Achievement of state-of-the-art results on MIntRec and MIntRec2.0 datasets.

Notable insights

The use of prototype-aware contrastive alignment suggests a novel approach to enhancing semantic consistency by leveraging class-level prototypes.
Coarse-to-fine attention fusion integrates global and local features, which may improve the model's ability to handle complex multimodal interactions.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2509.17446v3 Announce Type: replace-cross Abstract: Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05\% and +4.18\% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus.