Position: AI Safety Requires Effective Controllability

Yige Li, Yunhao Feng, Jun Sun

Published May 27, 2026Featured #8In the daily list May 28, 2026

Open on arXiv Read PDF

Daily score60.3

Editorial review6.8

Relevance0.515

Freshness0.722

Why It Matters

What makes this one worth your time

Ensuring AI systems can be controlled and overridden in real-time is crucial for deploying them safely in dynamic environments.

AI safety must prioritize controllability to ensure systems can be reliably interrupted and redirected.

Summary

The paper argues for the importance of controllability in AI safety, proposing it as a primary objective alongside alignment. It introduces a benchmark, controlbench, to evaluate controllability failures in high-risk scenarios and suggests a control-centric architectural framework for future AI systems.

Key contributions

Proposal of controllability as a first-class objective in AI safety.
Introduction of controlbench for evaluating controllability failures.
A control-centric architectural framework for designing future AI systems.

Notable insights

The introduction of controlbench as a benchmark for assessing controllability in AI systems.
Highlighting the gap between alignment and controllability in ensuring AI safety.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.27117v1 Announce Type: new Abstract: AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emph{controllability} as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench{}, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.