LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

Abhishek Chandwani, Ishan Gupta

Published Jun 1, 2026Featured #8In the daily list Jun 2, 2026

Open on arXiv Read PDF

Daily score63.5

Editorial review7.0

Relevance0.495

Freshness0.722

Why It Matters

What makes this one worth your time

This work is relevant for AI researchers and engineers interested in developing and evaluating AI systems that operate in complex, subjective, and context-dependent enterprise environments.

LH-Bench offers a novel framework for evaluating long-horizon agents on subjective tasks with expert-grounded rubrics.

Summary

The paper introduces LH-Bench, a new evaluation framework for assessing long-horizon agents on subjective enterprise tasks using a three-pillar design: expert-grounded rubrics, curated ground-truth artifacts, and pairwise human preference evaluation. It demonstrates that expert-grounded rubrics provide more reliable evaluation signals than those authored by LLMs and validates the approach with datasets in two environments: Figma-to-code and programmatic content.

Key contributions

Introduction of LH-Bench, a framework for evaluating long-horizon agents on subjective tasks.
Demonstration of expert-grounded rubrics' reliability over LLM-authored rubrics.
Release of public datasets and results on Figma-to-code and programmatic content environments.

Notable insights

Expert-grounded rubrics provide more reliable evaluation signals than LLM-authored rubrics.
Pairwise human preference evaluation is used for convergent validation of subjective task performance.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2603.22744v2 Announce Type: replace Abstract: Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46), and that human preference judgments confirm the same top-tier separation (p < 0.05), evidence that expert-grounded evaluation can scale without sacrificing reliability. We release public datasets and report results on two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (41 courses comprising 183 individually-evaluated chapters on a course platform serving 30+ daily users).