Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Paul Quinlan, Jeremy Levasseur, Qingguo Li, Xiaodan Zhu
Why It Matters
What makes this one worth your time
This work is significant for AI researchers interested in developing models that can seamlessly integrate and process multimodal data, potentially leading to more robust and versatile AI systems.
Chronicle is a unified model for language and time series that achieves competitive results across both domains.
Summary
The paper introduces Chronicle, a 324M-parameter transformer model trained from scratch to handle both natural language and time series data within a unified architecture. It is evaluated against both multimodal and unimodal models, showing competitive performance on language tasks and setting new benchmarks for time series classification and multimodal forecasting.
Key contributions
- Development of a unified transformer model for joint language and time series understanding.
- Demonstration of competitive performance on both language and time series tasks.
- Introduction of a new benchmark for multimodal forecasting.
Notable insights
- Chronicle uses a shared transformer architecture for both modalities, allowing cross-modal capabilities to emerge naturally from shared parameters.
- The model is evaluated against both multimodal and unimodal baselines, providing a more comprehensive assessment of its capabilities.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.20268v1 Announce Type: cross Abstract: Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.