NLG Evaluation: Past, Present, Future
Ehud Reiter
Why It Matters
What makes this one worth your time
Understanding the evolution of NLG evaluation helps researchers and engineers improve current methodologies and anticipate future challenges in the field.
The paper surveys the evolution of NLG evaluation and forecasts future trends.
Summary
The paper reviews the evolution of Natural Language Generation (NLG) evaluation from 1990 to the present and speculates on future trends, highlighting the shift from linguistic ties to machine learning and the development of new evaluation techniques like LLM-as-Judge.
Key contributions
- Historical overview of NLG evaluation methods.
- Identification of future trends in NLG evaluation.
Notable insights
- The transition from linguistic-based to machine learning-based evaluation in NLG.
- The introduction of LLM-as-Judge as a recent evaluation technique.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.23715v1 Announce Type: new Abstract: Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.