Multimodal Vision Language Safety Data Benchmark

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao, Heng Yang, Chen Lv

Published Apr 27, 2026

Open on arXiv Read PDF

Editorial review7.0

Relevance0.461

Freshness0.000

Why It Matters

What makes this one worth your time

This work addresses the gap in city-scale traffic analysis by providing a dataset and model that could enhance safety in urban transportation systems.

A new dataset and model for improving safety in urban transportation through vision-language reasoning.

Summary

The paper introduces the Land Transportation Dataset (LTD), a large-scale vision-language dataset for urban traffic environments, and proposes UniVLT, a transportation foundation model that unifies autonomous driving reasoning and city-scale traffic analysis.

Key contributions

Introduction of the Land Transportation Dataset (LTD) for urban traffic environments.
Development of UniVLT, a unified transportation foundation model for reasoning across diverse traffic scenarios.

Notable insights

The integration of multi-model vision-language generation with human-in-the-loop refinement for dataset annotation.
Curriculum-based knowledge transfer is used to train a unified model for both microscopic and macroscopic traffic analysis.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2604.22260v1 Announce Type: cross Abstract: Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.