Revisiting Model Stitching In the Foundation Model Era

Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo

Published Jun 5, 2026

Editorial review7.2

Relevance0.504

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding how to effectively stitch different VFMs can lead to more efficient and powerful multimodal models, which is crucial for advancing AI capabilities in complex tasks.

The paper explores model stitching for heterogeneous Vision Foundation Models, proposing methods to enhance compatibility and performance.

Summary

The paper revisits the concept of model stitching in the context of Vision Foundation Models (VFMs) with varying objectives, data, and modalities. It introduces a protocol for stitching these models and identifies key factors affecting stitching performance. The study finds that using a feature-matching loss at the target model's penultimate layer enables reliable stitching across vision tasks, and proposes a VFM Stitch Tree for optimizing accuracy-latency trade-offs in multimodal models.

Key contributions

Introduces a systematic protocol for stitching heterogeneous VFMs.
Proposes a feature-matching loss approach for reliable model stitching.
Develops the VFM Stitch Tree for optimizing accuracy-latency trade-offs.

Notable insights

Stitch layer training is crucial, with feature-matching loss at the penultimate layer improving stitchability.
Deep stitch points can enhance performance beyond individual models with minimal inference overhead.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2603.12433v3 Announce Type: replace-cross Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.