To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

Published Jun 6, 2026

Open on arXiv Read PDF

Editorial review7.2

Relevance0.477

Freshness0.000

Why It Matters

What makes this one worth your time

This approach can improve the precision of person retrieval systems in real-world scenarios where modalities may be inconsistently available, offering practical benefits for media archives and surveillance applications.

Adaptive modality detection enhances audio-visual person retrieval by leveraging cross-modal score consistency.

Summary

The paper proposes a query-adaptive framework for audio-visual person retrieval that detects active modalities based on cross-modal score consistency, achieving high detection accuracy and outperforming unimodal and fixed fusion systems on a large video corpus.

Key contributions

Introduction of a query-adaptive framework for modality detection in person retrieval.
Demonstration of improved retrieval performance on a large-scale video corpus.
Achievement of high detection accuracy for active modalities using cross-modal features.

Notable insights

Cross-modal score consistency is used to detect active modalities, improving retrieval accuracy.
The framework adapts to the presence or absence of modalities, avoiding noise from irrelevant data.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.05931v1 Announce Type: cross Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).