XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks
Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju
Why It Matters
What makes this one worth your time
Understanding cross-lingual skill gaps is crucial for improving the performance and fairness of language models in multilingual applications.
XLGoBench reveals cross-lingual performance disparities in large language models through algorithmic tasks.
Summary
The paper presents XLGoBench, a benchmark of synthetic algorithmic tasks designed to identify cross-lingual skill gaps in large language models by evaluating their performance across different languages.
Key contributions
- Introduction of a novel benchmark for evaluating cross-lingual performance in language models.
- Development of synthetic algorithmic tasks that are scalable and quantifiable.
- Demonstration of persistent cross-lingual gaps in state-of-the-art models through extensive experiments.
Notable insights
- The benchmark's scalability allows for adaptation to various model capabilities, enhancing its applicability.
- The use of simple templates for task generation promotes transparency and ease of auditing for translation errors.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2605.30788v1 Announce Type: cross Abstract: We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.