GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

Ryner Tan, Wenxuan Zhang

Published Jun 9, 2026

Editorial review6.8

Relevance0.477

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding the limitations of current LALMs in real-world, multilingual, and multicultural contexts is crucial for developing more effective and inclusive AI systems.

GlobeAudio is a new benchmark for testing LALMs on multilingual and multicultural audio understanding.

Summary

The paper introduces GlobeAudio, a multilingual and multicultural benchmark designed to evaluate the naturalistic audio understanding capabilities of Large Audio-Language Models (LALMs). It consists of 5,637 multiple-choice questions across six diverse languages, crafted by native speakers using naturally occurring audio. The study evaluates both closed-source and open-source LALMs, revealing significant performance gaps, especially in low-resource languages, highlighting the limitations of current models.

Key contributions

Introduction of GlobeAudio, a multilingual and multicultural benchmark for LALMs.
Systematic evaluation of LALMs and ASR-LLM pipelines under natural acoustic conditions.

Notable insights

The benchmark uses naturally occurring audio and native speaker expertise to ensure cultural and linguistic authenticity.
The evaluation reveals substantial performance gaps in open-source models, particularly in low-resource languages.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.08194v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .