UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

Volodymyr Ovcharov

Published May 29, 2026

Editorial review7.5

Relevance0.451

Freshness0.000

Why It Matters

What makes this one worth your time

This benchmark addresses a significant gap in legal NLP by focusing on non-English languages, which is crucial for developing robust AI systems that can operate in diverse legal environments.

UA-Legal-Bench sets a new standard for assessing LLMs in Ukrainian legal contexts.

Summary

The paper introduces UA-Legal-Bench, a benchmark designed to evaluate large language models on Ukrainian legal reasoning through five specific tasks, utilizing a substantial dataset from the Unified State Register of Court Decisions.

Key contributions

Introduction of a comprehensive benchmark for Ukrainian legal reasoning.
Evaluation of 11 large language models across multiple tasks, providing insights into their performance.
Release of data, prompts, and model predictions to facilitate further research.

Notable insights

The benchmark reveals task-dependent few-shot effects, indicating that model performance can vary significantly across different legal tasks.
Accuracy metrics can be misleading in imbalanced legal tasks, highlighting the importance of using macro-F1 scores for evaluation.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.29170v1 Announce Type: cross Abstract: Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.