GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models
Ryner Tan, Wenxuan Zhang
Why It Matters
What makes this one worth your time
Understanding the limitations of current LALMs in real-world, multilingual, and multicultural contexts is crucial for developing more effective and inclusive AI systems.
GlobeAudio is a new benchmark for testing LALMs on multilingual and multicultural audio understanding.
Summary
The paper introduces GlobeAudio, a multilingual and multicultural benchmark designed to evaluate the naturalistic audio understanding capabilities of Large Audio-Language Models (LALMs). It consists of 5,637 multiple-choice questions across six diverse languages, crafted by native speakers using naturally occurring audio. The study evaluates both closed-source and open-source LALMs, revealing significant performance gaps, especially in low-resource languages, highlighting the limitations of current models.
Key contributions
- Introduction of GlobeAudio, a multilingual and multicultural benchmark for LALMs.
- Systematic evaluation of LALMs and ASR-LLM pipelines under natural acoustic conditions.
Notable insights
- The benchmark uses naturally occurring audio and native speaker expertise to ensure cultural and linguistic authenticity.
- The evaluation reveals substantial performance gaps in open-source models, particularly in low-resource languages.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.08194v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .