Lost in Translation: Do LVLM Judges Generalize Across Languages?
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan, Mizanur Rahman, Shafiq Joty, Enamul Hoque, Jimmy Huang
Why It Matters
What makes this one worth your time
Understanding the limitations of LVLM judges across languages is crucial for developing more robust and reliable AI systems in a global context.
This study unveils significant cross-lingual performance issues in large vision-language models.
Summary
The paper introduces MM-JudgeBench, a large-scale benchmark for evaluating multilingual and multimodal judge models, revealing performance variances across languages and limitations in current reward modeling.
Key contributions
- Introduction of MM-JudgeBench, a benchmark for multilingual and multimodal evaluation.
- Evaluation of 22 LVLMs across 25 languages, highlighting performance variance.
- Release of a multilingual training set for domain adaptation.
Notable insights
- Model size and architecture do not reliably predict multilingual robustness.
- State-of-the-art LVLM judges show inconsistent behavior across different languages.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2604.19405v1 Announce Type: new Abstract: Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.