AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models
Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei
Why It Matters
What makes this one worth your time
Improving spatial reasoning in AI models is crucial for applications in robotics, navigation, and interactive systems, making this research relevant for advancing multimodal AI capabilities.
AlloSpatial enhances spatial reasoning in foundation models by transforming local observations into global representations.
Summary
The paper introduces AlloSpatial, a framework designed to enhance spatial reasoning in multimodal foundation models by converting egocentric observations into structured allocentric representations, and demonstrates its effectiveness through experiments.
Key contributions
- Development of the AlloSpatial framework for allocentric spatial cognition.
- Introduction of World2Mind for converting egocentric observations into structured allocentric priors.
- Demonstration of improved performance on spatial reasoning benchmarks like VSI-Bench and MindCube.
Notable insights
- The introduction of Allocentric-Spatial Trees and route maps as structured representations is a novel approach to spatial reasoning.
- The use of cold-start reinforcement learning with a harness-gated reward system is an innovative method to internalize spatial reasoning processes.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2606.08952v1 Announce Type: new Abstract: Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.