EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
Hadi Mohammadi, Anastasia Giachanou, Robert A. Bagheri
Why It Matters
What makes this one worth your time
Understanding moral alignment in AI is crucial for developing culturally aware systems that can operate effectively across diverse global contexts.
EvalMORAAL offers a novel framework for assessing moral alignment in large language models with notable regional insights.
Summary
The paper introduces EvalMORAAL, a framework for evaluating moral alignment in large language models using a combination of scoring methods and a model-as-judge peer review approach, revealing significant regional differences in alignment with global survey responses.
Key contributions
- Development of a transparent chain-of-thought framework for moral alignment evaluation.
- Implementation of two scoring methods for fair comparison across models.
- Identification of significant regional differences in model alignment with global survey data.
Notable insights
- The use of a model-as-judge peer review system adds a layer of interpretability and accountability to the evaluation process.
- The structured chain-of-thought protocol with self-consistency checks enhances the reliability of the scoring methods.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2510.05942v3 Announce Type: replace-cross Abstract: We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.