Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

Ali Al-Kaswan, Maksim Plotnikov, Maxim H\'ajek, Roland V\'izner, Arie van Deursen, Maliheh Izadi

Published Apr 22, 2026Featured #5In the daily list Apr 23, 2026

Open on arXiv Read PDF

Daily score72.4

Editorial review7.5

Relevance0.470

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding the limitations of LLM agents in cybersecurity tasks is crucial for developing more effective autonomous systems in this domain.

DeepRed benchmarks LLM agents on cybersecurity challenges with a novel scoring approach.

Summary

The paper introduces DeepRed, an open-source benchmark for evaluating Large Language Model (LLM) agents on Capture The Flag (CTF) challenges, employing a partial-credit scoring method to assess performance beyond simple solved/unsolved metrics.

Key contributions

Development of the DeepRed benchmark for LLM agents in CTF challenges.
Implementation of a partial-credit evaluation method based on challenge-specific checkpoints.
Benchmarking of ten commercially accessible LLMs across diverse CTF challenge categories.

Notable insights

The introduction of a partial-credit scoring method allows for a more nuanced evaluation of agent performance, addressing the binary nature of traditional assessments.
The automated summarise-then-judge pipeline for checkpoint completion offers a systematic approach to analyzing execution traces.

Possible limitations

The abstract does not address potential scalability issues of the benchmark or the generalizability of results across different types of CTF challenges.

Abstract

arXiv:2604.19354v1 Announce Type: new Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.