Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation
Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang
Why It Matters
What makes this one worth your time
Improving the transparency and reproducibility of human evaluations can enhance the reliability of research findings in text generation, which is crucial for advancing the field.
A critical examination of human evaluation protocols reveals significant under-reporting in long-form text generation studies.
Summary
The paper conducts a large-scale analysis of human evaluation protocols for long-form text generation, reviewing 284 papers manually and analyzing over 1,800 papers with LLM assistance to identify reporting norms and practices.
Key contributions
- A comprehensive manual review of 284 papers on human evaluation protocols.
- An LLM-assisted analysis of over 1,800 additional papers to assess reporting practices.
- Actionable recommendations for improving transparency and reproducibility in human evaluation studies.
Notable insights
- The study defines 20 reportable criteria for evaluating human evaluation studies, which could standardize reporting practices.
- The use of LLM-assisted analysis for a large dataset demonstrates an innovative approach to systematic review.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2606.07936v1 Announce Type: cross Abstract: Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard