AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering

Rajesh Kumar, Waqar Ali, Junaid Ahmed, Najma Imtiaz Ali, Shaban Usman

Published Apr 17, 2026Featured #10In the daily list Apr 18, 2026

Open on arXiv Read PDF

Daily score64.4

Editorial review7.5

Relevance0.467

Freshness0.722

Why It Matters

What makes this one worth your time

This framework addresses the critical gap in verifying the correctness of code generated by large language models, making it highly relevant for developers and researchers focused on reliable software automation.

AGENTFORGE enhances software engineering by enforcing execution-grounded verification in a multi-agent framework.

Summary

The paper presents AGENTFORGE, a multi-agent framework for software engineering that integrates execution-grounded verification, ensuring code changes are validated through sandboxed execution before being applied.

Key contributions

Introduction of execution-grounded verification as a core principle in software engineering with LLMs.
Development of a multi-agent framework (AGENTFORGE) that coordinates various specialized agents for improved code generation and verification.
Demonstration of significant performance improvements over single-agent baselines on the SWE-BENCH Lite benchmark.

Notable insights

Execution feedback is proposed as a stronger supervision signal than traditional next-token likelihood, which could shift how LLMs are trained for coding tasks.
The use of role decomposition among agents (Planner, Coder, Tester, Debugger, Critic) suggests a systematic approach to tackling complex software engineering tasks.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.13120v1 Announce Type: cross Abstract: Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi-agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox. We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next-token likelihood. AGENTFORGE achieves 40.0\% resolution on SWE-BENCH Lite, outperforming single-agent baselines by 26--28 points. Ablations confirm that execution feedback and role decomposition each independently drive performance. The framework is open-source at https://github.com/raja21068/AutoCodeAI.