Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

David Gringras

Published Jun 5, 2026Featured #9In the daily list Jun 6, 2026

Open on arXiv Read PDF

Daily score67.6

Editorial review7.5

Relevance0.450

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding the variability in safety measurements can help researchers and engineers make more informed decisions about deploying AI systems in real-world scenarios.

Evaluation conditions significantly influence AI model safety scores.

Summary

The paper investigates how different evaluation conditions affect the measured safety of AI models, revealing that safety scores can vary significantly based on the deployment configuration and the format of evaluation items.

Key contributions

Empirical evaluation of six frontier models across four deployment configurations and multiple safety benchmarks.
Identification of measurement artifacts that can distort safety evaluations, particularly in the context of map-reduce delegation.
Release of code, data, and prompts as part of the ScaffoldSafety initiative to facilitate further research.

Notable insights

The study highlights that the choice of evaluation format (multiple-choice vs. open-ended) can lead to substantial differences in safety scores, indicating a need for careful consideration in safety assessments.
The findings suggest that the architecture of scaffolds has minimal impact on outcome variance compared to benchmark choice, challenging assumptions about the importance of scaffold design.

Possible limitations

Potential biases in the selection of models and benchmarks are not addressed in the abstract.
The generalizability coefficient being very low raises questions about the applicability of findings to other contexts or models.

Abstract

arXiv:2603.10044v2 Announce Type: replace-cross Abstract: A safety score earned on a benchmark need not predict how the same model behaves once it is wrapped in an agentic scaffold the benchmark never tested. We ran six frontier models through four deployment configurations (direct API, ReAct, multi-agent critic, map-reduce delegation): N = 62,808 blinded, pre-registered, equivalence-tested evaluations across four safety benchmarks (BBQ, TruthfulQA, XSTest/OR-Bench, sycophancy), plus three supporting analyses. ReAct and multi-agent scaffolds stay within a pre-registered +/-2 pp equivalence margin; map-reduce delegation degrades measured safety (NNH = 14), though that loss is largely a measurement artifact: on identical items, multiple-choice versus open-ended phrasing shifts the measured safety rate by 5-20 pp, and decomposition silently strips the multiple-choice options. Roughly 40-89% of the per-model map-reduce loss is this format conversion rather than reasoning disruption, and an option-preserving variant recovers most of it. Pooled effects also mask sharp model-by-scaffold heterogeneity: under map-reduce, on identical items, Opus loses 16.8 pp while Llama 4 gains 18.8 pp. Structurally, scaffold architecture explains only 0.4% of outcome variance (benchmark choice explains 45x more), and the generalizability coefficient is G = 0.000 (bootstrap 95% CI [0.000, 0.752]). An interval that wide is enough on its own to undermine the utility of any single composite safety number as a deployment criterion. These are the "easy cases"; consequential properties like scheming and CBRN uplift have no obvious reason to be less format- or scaffold-sensitive. Code, data, and prompts are released as ScaffoldSafety.