From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

Published Jun 6, 2026Featured #6In the daily list Jun 7, 2026

Open on arXiv Read PDF

Daily score70.2

Editorial review7.5

Relevance0.462

Freshness0.722

Why It Matters

What makes this one worth your time

This work addresses critical safety concerns in LLM applications by proposing a novel approach to risk management that preserves task integrity while mitigating threats.

TRIAD enhances LLM agent safety by integrating feedback for iterative plan refinement.

Summary

The paper introduces TRIAD, a guardrail-integrated framework for LLM agents that utilizes feedback to guide agents in revising their plans, thereby improving safety and maintaining benign task objectives.

Key contributions

Introduction of the TRIAD framework for guardrail integration in LLM agents.
Development of a finetuned language model that outputs structured feedback for decision-making.
Demonstration of improved safety-utility trade-offs through extensive experimental validation.

Notable insights

The framework's use of structured natural-language feedback allows for more nuanced decision-making compared to traditional binary guardrails.
The closed-loop mechanism between feedback and planning could lead to more adaptive and resilient agent behaviors.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2606.05805v1 Announce Type: new Abstract: LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.