KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee, Jonghyun Lee

Published May 28, 2026Featured #9In the daily list May 29, 2026

Open on arXiv Read PDF

Daily score69.6

Editorial review7.5

Relevance0.464

Freshness0.722

Why It Matters

What makes this one worth your time

This work provides essential resources for assessing the performance of speech models in Korean, a language often overlooked in multilingual evaluations, thereby promoting inclusivity in AI research.

New Korean speech benchmarks enhance evaluation of multilingual SpeechLMs.

Summary

The paper introduces two frameworks for constructing Korean speech benchmarks and presents three specific benchmarks, KVoiceBench, KOpenAudioBench, and KMMAU, aimed at evaluating SpeechLMs in Korean, addressing limitations in multilingual speech evaluation.

Key contributions

Development of KVoiceBench and KOpenAudioBench for Korean SpokenQA.
Creation of KMMAU for Korean audio understanding.
Evaluation of eight SpeechLMs revealing performance gaps and task-specific weaknesses.

Notable insights

The proposed frameworks effectively address the challenges of transferring benchmarks across languages, preserving language-specific attributes.
The divergence in performance rankings between SpokenQA and audio understanding tasks highlights the need for nuanced evaluation metrics.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.27984v1 Announce Type: cross Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.