Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

Tingqiang Xu, Hangrui Zhou, Tianle Cai, Alex Gu, Kaifeng Lyu

Published Jun 12, 2026Featured #3In the daily list Jun 13, 2026

Open on arXiv Read PDF

Daily score72.3

Editorial review7.5

Relevance0.454

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding LLMs' capabilities in educational contexts can enhance their integration into programming education and improve human learning outcomes.

UOJ-Bench benchmarks LLMs on code generation, hacking, and repair in competitive programming.

Summary

The paper introduces UOJ-Bench, a benchmark for evaluating the problem-solving, error identification, and repair capabilities of Large Language Models (LLMs) in competitive programming, revealing significant limitations in current models' performance.

Key contributions

Introduction of UOJ-Bench as a novel benchmark for evaluating LLMs in competitive programming.
Identification of performance gaps in LLMs regarding error detection in human-written code.
Demonstration of the potential of LLMs to complement traditional judging systems.

Notable insights

Test-time scaling significantly improves error identification rates, but incurs high computational costs.
The benchmark leverages real-world submissions, providing a practical evaluation framework for LLMs.

Possible limitations

The abstract does not address the scalability of UOJ-Bench or the generalizability of results across different programming languages or problem types.

Abstract

arXiv:2606.12864v1 Announce Type: cross Abstract: Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.