Back to today's list

VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev, Thadd\"aus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

Published Jun 5, 2026Featured #7In the daily list Jun 6, 2026
Daily score69.3
Editorial review7.5
Relevance0.458
Freshness0.722

Why It Matters

What makes this one worth your time

This work is significant for researchers and engineers developing audio-visual models, as it provides a more reliable benchmark for assessing multi-modal understanding.

VGGSounder enhances audio-visual model evaluation with improved annotations and a new confusion metric.

Summary

The paper introduces VGGSounder, a re-annotated multi-label test set designed to evaluate audio-visual foundation models, addressing limitations in the existing VGGSound dataset.

Key contributions

  • Development of the VGGSounder dataset with comprehensive re-annotations.
  • Introduction of a new modality confusion metric for performance analysis.
  • Identification and analysis of limitations in the existing VGGSound dataset.

Notable insights

  • The introduction of a modality confusion metric allows for nuanced analysis of model performance when integrating additional modalities.
  • The identification of limitations in the VGGSound dataset highlights common pitfalls in multi-modal evaluation frameworks.

Possible limitations

  • Not stated in the abstract.

Abstract

arXiv:2508.08237v4 Announce Type: replace-cross Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.