JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

Published May 22, 2026Featured #3In the daily list May 8, 2026

Open on arXiv Read PDF

Daily score72.7

Editorial review7.5

Relevance0.470

Freshness0.722

Why It Matters

What makes this one worth your time

This research could significantly improve applications in vision-language-action systems, making it relevant for engineers and researchers working on multimodal AI.

JoyAI-Image advances multimodal capabilities with enhanced spatial intelligence.

Summary

The paper introduces JoyAI-Image, a multimodal foundation model that integrates visual understanding, text-to-image generation, and instruction-guided image editing through a novel architecture combining a spatially enhanced MLLM and MMDiT, demonstrating competitive performance across various benchmarks.

Key contributions

Development of JoyAI-Image, a unified multimodal foundation model.
Introduction of a scalable training recipe that incorporates spatially grounded data and editing signals.
Demonstration of state-of-the-art performance across multiple multimodal tasks.

Notable insights

The integration of spatially grounded data with unified instruction tuning is a clever approach to enhance geometry-aware reasoning.
The bidirectional interaction between understanding and editing suggests a novel method for improving spatial intelligence in AI models.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.04128v2 Announce Type: replace-cross Abstract: We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.