TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
Yinyi Luo, Wenwen Wang, Hayes Bai, Hongyu Zhu, Hao Chen, Pan He, Marios Savvides, Sharon Li, Jindong Wang
Why It Matters
What makes this one worth your time
A unified codebase like TorchUMM can streamline the evaluation process for multimodal models, making it easier for researchers to compare different architectures and improve model development.
TorchUMM provides a standardized framework for evaluating and analyzing diverse multimodal models.
Summary
The paper introduces TorchUMM, a unified codebase designed to evaluate, analyze, and facilitate post-training for various unified multimodal models. It aims to standardize evaluation protocols across different models and tasks, enabling fair comparisons and deeper insights into model capabilities.
Key contributions
- Development of a unified codebase for multimodal model evaluation and analysis.
- Support for a wide range of model scales and design paradigms.
- Integration of diverse datasets to assess various model capabilities.
Notable insights
- The integration of both established and novel datasets allows for comprehensive evaluation of multimodal models.
- Standardized evaluation protocols can lead to more reproducible and fair comparisons across different models.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2604.10784v2 Announce Type: replace Abstract: Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.