Human-Guided Harm Recovery for Computer Use Agents

Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu

Published May 29, 2026

Editorial review7.5

Relevance0.451

Freshness0.000

Why It Matters

What makes this one worth your time

As AI agents increasingly interact with real systems, effective harm recovery mechanisms are crucial for ensuring safety and user trust.

A novel approach to recover computer agents from harmful actions while aligning with human preferences.

Summary

The paper presents a framework for harm recovery in computer use agents, focusing on steering agents from harmful states back to safe ones in alignment with human preferences, supported by user studies and a new benchmark for evaluation.

Key contributions

Formalization of harm recovery as a problem in AI safety.
Development of a natural language rubric based on user preferences for recovery.
Creation of BackBench, a benchmark for evaluating recovery tasks in computer use agents.

Notable insights

The study identifies context-dependent shifts in user preferences for recovery strategies, emphasizing the importance of pragmatic approaches over comprehensive ones.
The introduction of BackBench as a benchmark for evaluating recovery capabilities provides a structured way to assess agent performance in harm recovery.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2604.18847v2 Announce Type: replace Abstract: As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,130 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.