Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

Published Jun 10, 2026Featured #10In the daily list Jun 11, 2026

Open on arXiv Read PDF

Daily score57.4

Editorial review6.8

Relevance0.454

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding the limitations of LLMs in practical office tasks is crucial for researchers and engineers aiming to improve automation capabilities and develop more effective AI systems.

The paper benchmarks LLMs on office automation tasks, highlighting their current limitations.

Summary

The paper evaluates the capability of frontier Large Language Models (LLMs) to perform office automation tasks by benchmarking them against China's National Computer Rank Examination, revealing significant limitations in their current proficiency.

Key contributions

Introduction of a benchmark based on the NCRE for evaluating LLMs in office automation.
Demonstration of the current limitations of LLMs in achieving high proficiency in practical office tasks.

Notable insights

The use of a standardized exam like the NCRE provides a structured and quantifiable way to assess LLM performance in office automation.
Iterative repair and execution feedback significantly improve LLM performance, though still fall short of human-level proficiency.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.10956v1 Announce Type: new Abstract: The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.