From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov

Published Apr 18, 2026Featured #2In the daily list Apr 20, 2026

Open on arXiv Read PDF

Daily score56.3

Editorial review7.2

Relevance0.464

Freshness0.389

Why It Matters

What makes this one worth your time

Understanding and formalizing user evaluation practices can bridge the gap between benchmark scores and real-world applicability, making LLM assessments more relevant.

This work formalizes user-driven vibe-testing of LLMs to enhance evaluation methodologies.

Summary

The paper investigates the informal practice of 'vibe-testing' LLMs by users and formalizes this process to enable systematic evaluation, introducing a proof-of-concept evaluation pipeline that incorporates personalized prompts and user-aware criteria.

Key contributions

Analysis of user evaluation practices through surveys and social media reports.
Formalization of vibe-testing as a two-part process involving personalization of tests and judgment criteria.
Development of a proof-of-concept evaluation pipeline that integrates user-aware subjective criteria.

Notable insights

The study reveals that user preferences significantly influence model evaluation outcomes, highlighting the subjective nature of LLM assessments.
The introduction of personalized prompts in evaluations may lead to different model preferences than traditional benchmarks suggest.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.14137v2 Announce Type: replace-cross Abstract: Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.