Structure-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation for Web Information Systems

Boxiang Zhao, Qince Li, Zhonghao Wang, Zelin Cao, Yi Wang, Peng Cheng, Bo Lin

Published May 18, 2026

Editorial review6.8

Relevance0.502

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and improving the structural fidelity of LLMs is crucial for their effective deployment in web-based systems, where accurate data formatting is essential for API interactions and data exchange.

Structure-BiEval provides a novel framework for evaluating LLMs' structural fidelity in web data systems.

Summary

The paper introduces Structure-BiEval, a self-supervised framework designed to evaluate the structural fidelity of large language models (LLMs) in web information systems by decoupling structure from content using deterministic intermediate representations.

Key contributions

Proposes a self-supervised framework for evaluating LLMs in web data systems.
Introduces metrics for decoupling structure from content in LLM evaluation.
Empirically benchmarks 15 LLMs across hierarchical and tabular data topologies.

Notable insights

The framework uses Content Semantic Accuracy and Normalized Tree Edit Distance to measure structural fidelity.
Mid-sized models can outperform larger ones in specific web data formatting tasks, challenging assumptions about model size and performance.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2601.19923v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) evolve into the core of Web-based autonomous agents and complex Web Information Systems, their ability to faithfully translate natural language into rigorous structured formats has become paramount, as this capability is critical for Web API invocation and data exchange. However, evaluating this structural fidelity in Web-native payloads remains a challenge: traditional text metrics fail to capture topological consistency in semi-structured Web data, while manual evaluation is prohibitively costly. To address this, we propose Structure-BiEval, a novel self-supervised framework for quantitative, annotation-free assessment tailored for Web data engineering. By leveraging deterministic Intermediate Representations, our framework effectively decouples structure from content, utilizing Content Semantic Accuracy and Normalized Tree Edit Distance as precise metrics. We empirically benchmark 15 state-of-the-art LLMs across dual Web structural topologies, namely Hierarchical Data (Web backend payloads) and Tabular Data (Web frontend presentation). The results reveal substantial variability in structural performance, including cases where mid-sized models unexpectedly outperform larger counterparts in Web data formatting. Furthermore, our findings show that deep recursive nesting poses a consistent challenge for Web agents across varying parameter scales.