When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, Jingzhou Xu

Published Jun 2, 2026

Editorial review6.8

Relevance0.493

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding and mitigating compositional risks in AI systems is crucial for ensuring the safety and reliability of complex agentic AI deployments.

SkillReact framework identifies hidden compositional risks in AI skill ecosystems.

Summary

The paper investigates the safety of agentic AI systems by examining whether individually safe skills can compose into unsafe skill sets. It introduces SkillReact, a framework for measuring compositional security, and applies it to a large set of skills, identifying potential risks that are not apparent when skills are evaluated individually.

Key contributions

Development of the SkillReact framework for compositional security measurement.
Identification of compositional risks in a large skill ecosystem using a combination of static analysis and human adjudication.
Demonstration of how different models handle potential risks differently, emphasizing the role of model behavior in risk realization.

Notable insights

The framework reveals that individually safe skills can form unsafe combinations, highlighting a gap in current safety evaluations.
The realization of compositional risks depends on the host model's disposition, suggesting the need for model-specific safety checks.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.00448v1 Announce Type: cross Abstract: LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.