SafeHarbor: Defining Precise Decision Boundaries via Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Hao Peng

Published Jul 24, 2026Featured #3In the daily list Jul 25, 2026

Open on arXiv Read PDF

Daily score68.1

Editorial review7.2

Relevance0.476

Freshness0.722

Why It Matters

What makes this one worth your time

As LLMs evolve into autonomous agents, ensuring their safe operation in real-world environments is crucial to prevent misuse and maintain user trust.

SafeHarbor enhances LLM agent safety by dynamically defining decision boundaries with a hierarchical memory system.

Summary

The paper introduces SafeHarbor, a framework for enhancing the safety of LLM agents by defining precise decision boundaries through a hierarchical memory-augmented guardrail system. It dynamically injects context-aware defense rules and optimizes its memory structure using an information entropy-based self-evolution mechanism, achieving high performance in distinguishing between benign and harmful requests.

Key contributions

Proposes SafeHarbor, a framework for defining precise decision boundaries for LLM agents.
Introduces a hierarchical memory system for dynamic rule injection without the need for retraining.
Develops an information entropy-based self-evolution mechanism for optimizing memory structure.

Notable insights

The use of a local hierarchical memory system for dynamic rule injection is a novel approach to enhancing LLM safety.
An information entropy-based self-evolution mechanism continuously optimizes the memory structure, potentially improving adaptability and efficiency.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.05704v3 Announce Type: replace-cross Abstract: Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning and tool execution. While these capabilities unlock substantial practical value, they also introduce new security risks, as adversaries can manipulate agents into performing harmful actions in real-world environments. Existing defense strategies mitigate such threats but frequently struggle to balance safety and utility, resulting in over-refusal of benign user requests. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.