LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Galadrielle Humblot-Renaux, Mohammad N. S. Jahromi, Rohat Bakuri-J{\o}rgensen, Marieke Anne Heyl, Asta S. Stage Jarlner, Maria Vlachou, Anna Murphy H{\o}genhaug, Desmond Elliott, Thomas Gammeltoft-Hansen, Thomas B. Moeslund

Published May 14, 2026

Open on arXiv Read PDF

Editorial review7.0

Relevance0.488

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding LLM performance in specialized legal domains can improve automated text annotation, potentially reducing costs and increasing efficiency in legal processes.

The study evaluates LLMs for annotating credibility in Danish asylum decisions, highlighting their potential and limitations.

Summary

The paper investigates the use of large language models (LLMs) for annotating credibility assessments in Danish asylum decision texts, introducing a new dataset called RAB-Cred. It evaluates the performance of 21 models and 30 prompt combinations in zero-shot and few-shot settings, analyzing errors and inconsistencies in LLM annotations.

Key contributions

Introduction of the RAB-Cred dataset for Danish legal text classification.
Benchmarking 21 models and 30 prompt combinations for credibility assessment in asylum decisions.
Analysis of error consistency and inter-class confusion in LLM annotations.

Notable insights

The study systematically evaluates error consistency across LLMs, providing insights into inter-class confusion and correlation with human confidence.
The research highlights the importance of evaluating beyond aggregated metrics to understand LLM performance in specialized tasks.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.13412v1 Announce Type: cross Abstract: Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred