MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning
Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua
Why It Matters
What makes this one worth your time
This work addresses the challenge of effectively utilizing long-tail events in social intelligence tasks, which is crucial for improving AI's understanding of nuanced social interactions.
A novel framework for social intelligence reasoning using multi-agent collaboration and knowledge distillation.
Summary
The paper presents a multi-agent framework leveraging a lightweight Multimodal Large Language Model (MLLM) for social intelligence reasoning, incorporating knowledge distillation and Test-Time Adaptation (TTA) to enhance performance on long-tail events.
Key contributions
- Development of a multi-agent collaborative framework for social intelligence reasoning.
- Implementation of knowledge distillation in both training and inference phases.
- Introduction of a formatting strategy to prioritize long-tail events during tokenization.
Notable insights
- The integration of Test-Time Adaptation (TTA) across the reasoning pipeline is a clever approach to enhance model performance in real-time scenarios.
- Utilizing Low-Rank Adaptation (LoRA) for fine-tuning instance-level reasoning is an innovative method to optimize the model without extensive resource requirements.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2606.12018v1 Announce Type: new Abstract: We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.