Owner-Harm: A Missing Threat Model for AI Agent Safety

Dongcheng Zhang, Yiqing Jiang

Published Apr 22, 2026

Editorial review7.5

Relevance0.458

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and mitigating owner-harm is crucial for the safe deployment of AI agents, especially as their use becomes more prevalent in sensitive commercial environments.

Owner-Harm fills a critical gap in AI agent safety by addressing threats to deployers.

Summary

The paper introduces the Owner-Harm threat model, identifying a gap in existing AI safety benchmarks by categorizing harmful behaviors that AI agents can inflict on their deployers, and presents empirical results demonstrating the limitations of current defenses against such threats.

Key contributions

Development of the Owner-Harm threat model with eight distinct categories of harmful agent behavior.
Empirical evaluation of existing safety systems revealing significant gaps in owner-harm detection.
Introduction of the SSDG framework to relate information coverage to detection efficacy.

Notable insights

The proposed Owner-Harm model categorizes specific harmful behaviors that AI agents can exhibit towards their deployers, which have been largely overlooked in existing safety frameworks.
The introduction of the SSDG framework highlights the relationship between information coverage and detection rates, suggesting that context is vital for effective harm detection.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.18658v1 Announce Type: cross Abstract: Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar-injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner-Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%-32.5%) on AgentDojo injection tasks (prompt-injection-mediated owner harm). A controlled generic-LLM baseline shows the gap is not inherent to owner-harm (62.7% vs. 59.3%, delta 3.4 pp) but arises from environment-bound symbolic rules that fail to generalize across tool vocabularies. On a post-hoc 300-scenario owner-harm benchmark, the gate alone achieves 75.3% TPR / 3.3% FPR; adding a deterministic post-audit verifier raises overall TPR to 85.3% (+10.0 pp) and Hijacking detection from 43.3% to 93.3%, demonstrating strong layer complementarity. We introduce the Symbolic-Semantic Defense Generalization (SSDG) framework relating information coverage to detection rate. Two SSDG experiments partially validate it: context deprivation amplifies the detection gap 3.4x (R = 3.60 vs. R = 1.06); context injection reveals structured goal-action alignment, not text concatenation, is required for effective owner-harm detection.