EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Hadi Mohammadi, Anastasia Giachanou, Robert A. Bagheri

Published May 22, 2026Featured #3In the daily list May 23, 2026

Open on arXiv Read PDF

Daily score71.3

Editorial review7.5

Relevance0.455

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding moral alignment in AI is crucial for developing culturally aware systems that can operate effectively across diverse global contexts.

EvalMORAAL offers a novel framework for assessing moral alignment in large language models with notable regional insights.

Summary

The paper introduces EvalMORAAL, a framework for evaluating moral alignment in large language models using a combination of scoring methods and a model-as-judge peer review approach, revealing significant regional differences in alignment with global survey responses.

Key contributions

Development of a transparent chain-of-thought framework for moral alignment evaluation.
Implementation of two scoring methods for fair comparison across models.
Identification of significant regional differences in model alignment with global survey data.

Notable insights

The use of a model-as-judge peer review system adds a layer of interpretability and accountability to the evaluation process.
The structured chain-of-thought protocol with self-consistency checks enhances the reliability of the scoring methods.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2510.05942v3 Announce Type: replace-cross Abstract: We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.