XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju

Published Jun 1, 2026Featured #3In the daily list Jun 2, 2026

Open on arXiv Read PDF

Daily score71.0

Editorial review7.5

Relevance0.450

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding cross-lingual skill gaps is crucial for improving the performance and fairness of language models in multilingual applications.

XLGoBench reveals cross-lingual performance disparities in large language models through algorithmic tasks.

Summary

The paper presents XLGoBench, a benchmark of synthetic algorithmic tasks designed to identify cross-lingual skill gaps in large language models by evaluating their performance across different languages.

Key contributions

Introduction of a novel benchmark for evaluating cross-lingual performance in language models.
Development of synthetic algorithmic tasks that are scalable and quantifiable.
Demonstration of persistent cross-lingual gaps in state-of-the-art models through extensive experiments.

Notable insights

The benchmark's scalability allows for adaptation to various model capabilities, enhancing its applicability.
The use of simple templates for task generation promotes transparency and ease of auditing for translation errors.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.30788v1 Announce Type: cross Abstract: We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.