GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Yue Liu
Why It Matters
What makes this one worth your time
As large language models evolve, a robust and adaptive evaluation system is crucial for ensuring that conversational agents meet human-like standards, making this work relevant for both researchers and practitioners in AI.
GrowLoop revolutionizes conversation evaluation by continuously adapting to model advancements and human expectations.
Summary
The paper introduces GrowLoop, a self-evolving conversation evaluation system that adapts to advancements in large language models and human expectations by refining evaluation rubrics through human seed annotations and heuristic learning.
Key contributions
- Development of a self-evolving evaluation system for conversational agents.
- Introduction of a mechanism for continuous adaptation of evaluation rubrics based on human input.
- Demonstration of improved alignment with human judgments compared to existing evaluation methods.
Notable insights
- The use of Heuristic Learning to refine evaluation rubrics from minimal human input is a clever approach to tackle the implicit nature of human-likeness criteria.
- The Rubric-Case co-evolution mechanism allows for dynamic adaptation to new scenarios, which is essential in the rapidly changing landscape of AI capabilities.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2605.28882v2 Announce Type: replace-cross Abstract: With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-likeness is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. Starting from minimal human seed annotations, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution. When the evaluation target shifts, new human seeds expand the system's coverage accordingly. When applied to human-likeness evaluation in open-ended conversation, the AI judge guided by these rubrics not only substantially outperforms existing methods in alignment with human judgments, but also uncovers issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.