LLMbench: A Comparative Close Reading Workbench for Large Language Models

David M. Berry

Published Apr 21, 2026Featured #2In the daily list Apr 22, 2026

Open on arXiv Read PDF

Daily score64.9

Editorial review7.5

Relevance0.483

Freshness0.722

Why It Matters

What makes this one worth your time

This tool provides researchers in the digital humanities with a new way to engage with LLM outputs, potentially enriching the understanding of generative AI's implications.

LLMbench offers a novel approach to analyzing LLM outputs through qualitative hermeneutics.

Summary

The paper presents LLMbench, a browser-based tool designed for the comparative close reading of large language model outputs, emphasizing qualitative analysis over quantitative metrics.

Key contributions

Development of a browser-based workbench for comparative analysis of LLM outputs.
Introduction of analytical overlays for token-level, word-level, and sentence-level insights.
Provision of visualizations that represent the probabilistic structure of generated text.

Notable insights

The tool's focus on log-probability data as a resource for critical studies is a unique angle that may enhance qualitative analysis.
The use of multiple analytical overlays and modes allows for a nuanced exploration of LLM outputs that goes beyond traditional evaluation methods.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.15508v1 Announce Type: cross Abstract: LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR's LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible at the token level. The tool treats the generated text as a research object in its own right from a probability distribution, a text that could have been otherwise, and provides visualisations including continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains, that show the counterfactual history from which each word emerged. This paper describes the tool's architecture, its six modes, and its design rationale, and argues that log-probability data, currently underused in humanistic and social-scientific readings of AI, is an important resource for a critical studies of generative AI models.