Back to today's list

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Hao Peng

Published May 25, 2026Featured #10In the daily list May 26, 2026
Daily score62.6
Editorial review7.2
Relevance0.474
Freshness0.722

Why It Matters

What makes this one worth your time

As LLMs transition into autonomous agents, ensuring their safe operation in real-world environments is crucial to prevent misuse and maintain user trust.

SafeHarbor enhances LLM agent safety with dynamic, context-aware decision rules.

Summary

The paper introduces SafeHarbor, a framework for enhancing the safety of LLM agents by establishing context-aware decision boundaries using a hierarchical memory system and an entropy-based self-evolution mechanism.

Key contributions

  • Proposes SafeHarbor, a framework for context-aware safety in LLM agents.
  • Introduces a hierarchical memory system for dynamic rule management.
  • Develops an entropy-based mechanism for self-evolution of memory structures.

Notable insights

  • The use of a local hierarchical memory system for dynamic rule injection is a novel approach to balancing safety and utility.
  • An information entropy-based self-evolution mechanism allows the system to continuously optimize its memory structure.

Possible limitations

  • Not stated in the abstract

Abstract

arXiv:2605.05704v2 Announce Type: replace-cross Abstract: Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning and tool execution. While these capabilities unlock substantial practical value, they also introduce new security risks, as adversaries can manipulate agents into performing harmful actions in real-world environments. Existing defense strategies mitigate such threats but frequently struggle to balance safety and utility, resulting in over-refusal of benign user requests. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.