AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen, Hong Chang, Hao Liu, Shiguang Shan
Why It Matters
What makes this one worth your time
This work addresses the critical challenge of generating human motion across multiple modalities, which is essential for advancements in robotics and computer vision applications.
AnyMo leverages a large dataset to enhance multimodal conditional motion generation.
Summary
The paper introduces AnyMo, a multimodal framework for conditional human motion generation, utilizing a large-scale dataset called OmniHuMo with diverse modality annotations, and demonstrates its capability for high-fidelity motion synthesis across various control signals.
Key contributions
- Introduction of the OmniHuMo dataset with over 5,000 hours of motion data and multimodal annotations.
- Development of the AnyMo framework that combines advanced motion tokenization and masked modeling for improved synthesis.
- Demonstration of flexible control over spatial and stylistic attributes in motion generation.
Notable insights
- The integration of a Residual FSQ-based motion tokenizer with a masked modeling transformer is a novel approach for enhancing motion synthesis.
- The dataset OmniHuMo's scale and diversity may provide a significant advantage in training models for generalization across modalities.
Possible limitations
- Not stated in the abstract, but potential challenges include the quality and consistency of the multimodal annotations and the computational resources required for training on such a large dataset.
Abstract
arXiv:2605.29488v1 Announce Type: cross Abstract: Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.