ProgramBench: Can Language Models Rebuild Programs From Scratch?

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press

Published May 7, 2026

Open on arXiv Read PDF

Editorial review7.2

Relevance0.452

Freshness0.000

Why It Matters

What makes this one worth your time

This research addresses a significant gap in evaluating language models for software development, moving beyond simple tasks to more complex, real-world scenarios.

ProgramBench assesses language models' capability to architect and implement software from scratch.

Summary

The paper introduces ProgramBench, a benchmark designed to evaluate the ability of language models to develop software projects from scratch, requiring high-level architectural decisions and holistic implementation based on provided documentation.

Key contributions

Introduction of the ProgramBench benchmark for holistic software development evaluation.
Evaluation of nine language models against 200 diverse software tasks.
Identification of common pitfalls in model-generated code, such as monolithic structures.

Notable insights

The use of agent-driven fuzzing for generating end-to-end behavioral tests is a novel approach to evaluate software development capabilities.
The findings indicate a tendency for models to produce monolithic implementations, highlighting a divergence from human coding practices.

Possible limitations

The abstract does not specify the criteria for task difficulty or the diversity of the software tasks beyond the mentioned examples.
Potential limitations in the generalizability of results to other programming languages or paradigms are not addressed.

Abstract

arXiv:2605.03546v1 Announce Type: cross Abstract: Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.