VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev, Thadd\"aus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

Published Jul 1, 2026Featured #10In the daily list Jul 2, 2026

Open on arXiv Read PDF

Daily score53.8

Editorial review6.8

Relevance0.459

Freshness0.722

Why It Matters

What makes this one worth your time

Improving the evaluation of audio-visual models is crucial for developing more robust and reliable multi-modal AI systems.

VGGSounder enhances audio-visual model evaluation by providing a more accurate and detailed test set.

Summary

The paper introduces VGGSounder, a re-annotated multi-label test set designed to evaluate audio-visual foundation models, addressing limitations in the VGGSound dataset such as incomplete labeling and misaligned modalities.

Key contributions

Re-annotation of the VGGSound dataset to create VGGSounder.
Development of a modality confusion metric for evaluating multi-modal models.

Notable insights

The introduction of a modality confusion metric to analyze performance degradation when adding another input modality.
Detailed modality annotations in VGGSounder allow for precise analyses of modality-specific performance.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2508.08237v5 Announce Type: replace-cross Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.