Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures
Jaineet Shah
Why It Matters
What makes this one worth your time
Understanding and attributing failures in LLM agents is crucial for improving their reliability and safety, making this work relevant for AI engineers and researchers focused on agent robustness.
Causal Agent Replay identifies failure steps in LLM agents using causal modeling and interventions.
Summary
The paper introduces Causal Agent Replay (CAR), a method for identifying the causal steps leading to failures in large language model (LLM) agents by modeling agent runs as structural causal models and applying interventions to measure outcome shifts. The approach includes a contrastive estimator and a Monte-Carlo Shapley estimator to attribute failures to specific steps, validated against synthetic models.
Key contributions
- Introduction of Causal Agent Replay (CAR) for causal attribution in LLM-agent failures.
- Development of a contrastive estimator and a Monte-Carlo Shapley estimator for step-level failure analysis.
- Validation of the method against synthetic structural causal models with planted ground truth.
Notable insights
- The use of structural causal models and do-operations to trace failure steps in LLM agents is a novel approach.
- The combination of a contrastive estimator and a Monte-Carlo Shapley estimator provides a nuanced method for credit assignment across agent steps.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.08275v1 Announce Type: cross Abstract: When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.