Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman
Why It Matters
What makes this one worth your time
Understanding cost-effective design choices for LLM agents in complex environments can guide practitioners in developing more efficient and effective AI systems.
The study reveals that structured task decomposition and context engineering outperform deeper reasoning in adversarial POMDPs.
Summary
The paper investigates the design of compound LLM agents in adversarial POMDP environments, focusing on context representation, reasoning, and task decomposition. It evaluates different configurations in a cyber defense setting, finding that programmatic state abstraction and hierarchical decomposition without deliberation yield better performance and cost-efficiency.
Key contributions
- Controlled study of compound LLM agent design in an adversarial POMDP environment.
- Evaluation of context representation, reasoning, and task decomposition strategies.
- Identification of cost-effective design principles for structured adversarial POMDPs.
Notable insights
- Programmatic state abstraction significantly improves performance per token spent.
- Hierarchical decomposition without deliberation tools achieves the best performance.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.16205v1 Announce Type: new Abstract: Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.