MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion
Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He, Yaxin Xue
Why It Matters
What makes this one worth your time
This research addresses critical challenges in multimodal intent recognition, making it relevant for applications requiring robust understanding in noisy environments.
MVCL-DAF++ advances multimodal intent recognition through innovative alignment and attention techniques.
Summary
The paper introduces MVCL-DAF++, an enhancement to multimodal intent recognition that incorporates prototype-aware contrastive alignment and coarse-to-fine attention fusion to improve semantic grounding and robustness, particularly for rare-class recognition.
Key contributions
- Introduction of prototype-aware contrastive alignment for improved semantic grounding.
- Development of coarse-to-fine attention fusion for better hierarchical cross-modal interaction.
- Achievement of state-of-the-art results on MIntRec and MIntRec2.0 datasets.
Notable insights
- The use of prototype-aware contrastive alignment suggests a novel approach to enhancing semantic consistency by leveraging class-level prototypes.
- Coarse-to-fine attention fusion integrates global and local features, which may improve the model's ability to handle complex multimodal interactions.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2509.17446v3 Announce Type: replace-cross Abstract: Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05\% and +4.18\% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus.