Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang

Published May 27, 2026Featured #3In the daily list May 28, 2026

Open on arXiv Read PDF

Daily score69.6

Editorial review7.2

Relevance0.505

Freshness0.722

Why It Matters

What makes this one worth your time

This approach provides a scalable solution to improve the safety of AI agents without altering their underlying models, which is crucial for deploying AI in real-world applications where safety is paramount.

Thought-Aligner enhances agent safety by correcting unsafe thoughts before actions.

Summary

The paper introduces Thought-Aligner, a model-agnostic plug-in that enhances the behavioral safety of LLM-based agents by correcting unsafe intermediate thoughts before actions are executed. It uses a two-stage contrastive learning approach on paired safe and unsafe thoughts across various risk scenarios, demonstrating improved safety and helpfulness in experiments.

Key contributions

Introduction of Thought-Aligner, a plug-in safety model for correcting unsafe thoughts.
Demonstration of improved safety and helpfulness in LLM-based agents through experiments.
Development of a two-stage contrastive learning method for training the safety model.

Notable insights

The use of causal correction on intermediate thoughts rather than final outputs is a novel approach to improving agent safety.
The model-agnostic nature of Thought-Aligner allows it to be integrated into diverse agent frameworks without intrusive modifications.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2505.11063v3 Announce Type: replace Abstract: LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on diverse agent-safety benchmarks and six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment. We publicly release Thought-Aligner-7B at https://huggingface.co/WhitzardAgent/Thought-Aligner-7B.