UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Houyuan Chen, Hong Li, Xianghao Kong, Tianrui Zhu, Shaocong Xu, Weiqing Xiao, Yuwei Guo, Chongjie Ye, Lvmin Zhang, Hao Zhao, Anyi Rao

Published May 4, 2026Featured #6In the daily list May 5, 2026

Open on arXiv Read PDF

Daily score64.0

Editorial review7.2

Relevance0.457

Freshness0.722

Why It Matters

What makes this one worth your time

This framework could streamline the development of multimodal video applications by reducing the need for separate models for each task, potentially saving resources and improving cross-modal consistency.

UniVidX unifies multimodal video generation using diffusion priors for versatile and consistent synthesis.

Summary

The paper introduces UniVidX, a unified multimodal framework for video generation that leverages video diffusion model priors to enable versatile and cross-modal video synthesis. It employs techniques like Stochastic Condition Masking, Decoupled Gated LoRA, and Cross-Modal Self-Attention to facilitate omni-directional conditional generation and inter-modal alignment. The framework is applied to RGB videos and intrinsic maps, as well as blended RGB videos and RGBA layers, showing competitive performance with state-of-the-art methods.

Key contributions

Introduction of a unified multimodal framework for video generation using diffusion priors.
Development of Stochastic Condition Masking for flexible conditional generation.
Implementation of Cross-Modal Self-Attention for improved inter-modal alignment.

Notable insights

Stochastic Condition Masking allows for omni-directional conditional generation by partitioning modalities into clean and noisy conditions.
Cross-Modal Self-Attention facilitates inter-modal alignment by sharing keys and values across modalities while maintaining modality-specific queries.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.00658v1 Announce Type: new Abstract: Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/