Back to today's list

Emergent alignment and the projectability of ethical personas

Guillermo Del Pinal, Youngchan Lee, Calum McNamara, Alejandro Perez Carballo

Published Jun 10, 2026Featured #8In the daily list Jun 11, 2026
Daily score59.3
Editorial review6.8
Relevance0.481
Freshness0.722

Why It Matters

What makes this one worth your time

Understanding how language models can be aligned with ethical frameworks is crucial for developing AI systems that behave predictably and ethically in diverse scenarios.

The study shows that fine-tuning language models on ethical frameworks can induce distinct ethical personas.

Summary

The paper explores 'emergent alignment' in language models by fine-tuning them on safety tasks using different ethical frameworks, demonstrating that models can develop distinct 'ethical personas' aligned with these frameworks.

Key contributions

  • Investigation of emergent alignment in language models.
  • Application of ethical frameworks to fine-tune models.
  • Development of a diagnostic tool for evaluating ethical personas.

Notable insights

  • Fine-tuning on narrow safety tasks can induce broader alignment with ethical frameworks.
  • Using a multidimensional ethical persona diagnostic provides a nuanced evaluation of model alignment.

Possible limitations

  • Not stated in the abstract

Abstract

arXiv:2606.09475v2 Announce Type: replace Abstract: Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.