Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Jungmin Choi, Keisuke Sakaguchi, Hiroaki Yamada

Published Apr 29, 2026Featured #6In the daily list Apr 30, 2026

Open on arXiv Read PDF

Daily score70.2

Editorial review7.5

Relevance0.453

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding LLMs' capabilities and limitations in legal reasoning is crucial for developing reliable AI tools in legal contexts, especially in jurisdictions with unique legal frameworks like Japan.

The paper presents a unique dataset for assessing LLMs' legal reasoning in Japan's bar exam context.

Summary

This study introduces a novel dataset for evaluating large language models' open-ended legal reasoning in the context of the Japanese bar exam, focusing on manual evaluations by legal experts to assess model performance and identify limitations.

Key contributions

Creation of the first dataset for evaluating LLMs on the Japanese bar exam's open-ended writing tasks.
Manual evaluation of LLM responses by legal experts to identify reasoning limitations.
Characterization of hallucinations in LLM outputs related to legal content.

Notable insights

The study highlights the specific challenges LLMs face in generating structured legal arguments from complex narratives.
The manual analysis of hallucinations provides a deeper understanding of when LLMs deviate from legal precedents.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.23730v1 Announce Type: new Abstract: Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs' generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online.