Revisiting the Reliability of Language Models in Instruction-Following

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

Published May 29, 2026Featured #7In the daily list Apr 17, 2026

Open on arXiv Read PDF

Daily score71.7

Editorial review8.5

Relevance0.492

Freshness0.722

Why It Matters

What makes this one worth your time

As language models are increasingly deployed in real-world applications, ensuring their reliability across varied and nuanced user inputs is crucial for dependable and trustworthy AI services.

The paper introduces a new metric and benchmark to assess and improve the nuance-oriented reliability of language models.

Summary

This paper addresses the critical issue of nuance-oriented reliability in language models, introducing a new metric, reliable@k, and an enhanced benchmark, IFEval++, to evaluate model performance across subtly varied prompts. The study reveals significant performance drops in current models when faced with nuanced prompt modifications, highlighting the need for improved reliability in real-world applications.

Key contributions

Introduction of the reliable@k metric and the IFEval++ benchmark for evaluating nuance-oriented reliability in language models.

Notable insights

Current language models show significant performance variability with nuanced prompt changes, indicating a gap in their instruction-following capabilities.

Possible limitations

The study focuses on a specific aspect of reliability and may not address other dimensions of model robustness or generalization.

Abstract

arXiv:2512.14754v3 Announce Type: replace-cross Abstract: Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.