AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering
Rajesh Kumar, Waqar Ali, Junaid Ahmed, Najma Imtiaz Ali, Shaban Usman
Why It Matters
What makes this one worth your time
This framework addresses the critical gap in verifying the correctness of code generated by large language models, making it highly relevant for developers and researchers focused on reliable software automation.
AGENTFORGE enhances software engineering by enforcing execution-grounded verification in a multi-agent framework.
Summary
The paper presents AGENTFORGE, a multi-agent framework for software engineering that integrates execution-grounded verification, ensuring code changes are validated through sandboxed execution before being applied.
Key contributions
- Introduction of execution-grounded verification as a core principle in software engineering with LLMs.
- Development of a multi-agent framework (AGENTFORGE) that coordinates various specialized agents for improved code generation and verification.
- Demonstration of significant performance improvements over single-agent baselines on the SWE-BENCH Lite benchmark.
Notable insights
- Execution feedback is proposed as a stronger supervision signal than traditional next-token likelihood, which could shift how LLMs are trained for coding tasks.
- The use of role decomposition among agents (Planner, Coder, Tester, Debugger, Critic) suggests a systematic approach to tackling complex software engineering tasks.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2604.13120v1 Announce Type: cross Abstract: Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi-agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox. We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next-token likelihood. AGENTFORGE achieves 40.0\% resolution on SWE-BENCH Lite, outperforming single-agent baselines by 26--28 points. Ablations confirm that execution feedback and role decomposition each independently drive performance. The framework is open-source at https://github.com/raja21068/AutoCodeAI.