MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

Published Jun 11, 2026Featured #8In the daily list Jun 12, 2026

Open on arXiv Read PDF

Daily score67.3

Editorial review7.5

Relevance0.454

Freshness0.722

Why It Matters

What makes this one worth your time

This work addresses the challenge of effectively utilizing long-tail events in social intelligence tasks, which is crucial for improving AI's understanding of nuanced social interactions.

A novel framework for social intelligence reasoning using multi-agent collaboration and knowledge distillation.

Summary

The paper presents a multi-agent framework leveraging a lightweight Multimodal Large Language Model (MLLM) for social intelligence reasoning, incorporating knowledge distillation and Test-Time Adaptation (TTA) to enhance performance on long-tail events.

Key contributions

Development of a multi-agent collaborative framework for social intelligence reasoning.
Implementation of knowledge distillation in both training and inference phases.
Introduction of a formatting strategy to prioritize long-tail events during tokenization.

Notable insights

The integration of Test-Time Adaptation (TTA) across the reasoning pipeline is a clever approach to enhance model performance in real-time scenarios.
Utilizing Low-Rank Adaptation (LoRA) for fine-tuning instance-level reasoning is an innovative method to optimize the model without extensive resource requirements.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2606.12018v1 Announce Type: new Abstract: We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.