The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao

Published Apr 21, 2026Featured #4In the daily list Apr 22, 2026

Open on arXiv Read PDF

Daily score64.8

Editorial review7.5

Relevance0.478

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding these vulnerabilities is crucial for developing safer AI systems, especially as CUAs become more prevalent in complex digital environments.

This research uncovers critical vulnerabilities in CUAs when following benign user instructions.

Summary

The paper introduces OS-BLIND, a benchmark designed to evaluate computer-use agents (CUAs) under benign user instructions that can lead to unintended harmful actions, revealing high attack success rates in various models and multi-agent systems.

Key contributions

Introduction of the OS-BLIND benchmark for evaluating CUAs under unintended attack conditions.
Empirical analysis demonstrating high attack success rates across various models and scenarios.
Identification of limitations in current safety alignment mechanisms when faced with benign user instructions.

Notable insights

The attack success rate significantly increases in multi-agent systems, indicating that task decomposition can obscure harmful intents.
Existing safety defenses are shown to be ineffective in scenarios where user instructions are benign.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.10577v2 Announce Type: replace-cross Abstract: Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.