Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Thibault Ba\~neras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour

Published Apr 30, 2026Featured #10In the daily list May 1, 2026

Open on arXiv Read PDF

Daily score65.8

Editorial review7.2

Relevance0.527

Freshness0.722

Why It Matters

What makes this one worth your time

This research provides insights into improving ASR evaluation methods, which could lead to more accurate and human-aligned speech recognition systems.

Decoder-based LLMs outperform traditional metrics in ASR evaluation by aligning more closely with human perception.

Summary

The paper explores the use of decoder-based large language models (LLMs) for evaluating automatic speech recognition (ASR) systems, comparing their performance to traditional word error rate (WER) and embedding-based semantic metrics. It demonstrates that LLMs achieve higher agreement with human annotators in hypothesis selection and offer a promising direction for semantic and interpretable ASR evaluation.

Key contributions

Evaluation of decoder-based LLMs for ASR hypothesis selection.
Comparison of LLM embeddings with traditional semantic metrics.
Introduction of qualitative classification of ASR errors using LLMs.

Notable insights

LLMs achieve significantly higher agreement with human annotators compared to WER in hypothesis selection.
Decoder-based LLM embeddings perform comparably to encoder models in semantic distance computation.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2604.21928v2 Announce Type: replace Abstract: Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.