UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
Volodymyr Ovcharov
Why It Matters
What makes this one worth your time
This benchmark addresses a significant gap in legal NLP by focusing on non-English languages, which is crucial for developing robust AI systems that can operate in diverse legal environments.
UA-Legal-Bench sets a new standard for assessing LLMs in Ukrainian legal contexts.
Summary
The paper introduces UA-Legal-Bench, a benchmark designed to evaluate large language models on Ukrainian legal reasoning through five specific tasks, utilizing a substantial dataset from the Unified State Register of Court Decisions.
Key contributions
- Introduction of a comprehensive benchmark for Ukrainian legal reasoning.
- Evaluation of 11 large language models across multiple tasks, providing insights into their performance.
- Release of data, prompts, and model predictions to facilitate further research.
Notable insights
- The benchmark reveals task-dependent few-shot effects, indicating that model performance can vary significantly across different legal tasks.
- Accuracy metrics can be misleading in imbalanced legal tasks, highlighting the importance of using macro-F1 scores for evaluation.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2605.29170v1 Announce Type: cross Abstract: Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.