Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Paul Quinlan, Jeremy Levasseur, Qingguo Li, Xiaodan Zhu

Published May 22, 2026Featured #8In the daily list May 23, 2026

Open on arXiv Read PDF

Daily score64.3

Editorial review7.2

Relevance0.483

Freshness0.722

Why It Matters

What makes this one worth your time

This work is significant for AI researchers interested in developing models that can seamlessly integrate and process multimodal data, potentially leading to more robust and versatile AI systems.

Chronicle is a unified model for language and time series that achieves competitive results across both domains.

Summary

The paper introduces Chronicle, a 324M-parameter transformer model trained from scratch to handle both natural language and time series data within a unified architecture. It is evaluated against both multimodal and unimodal models, showing competitive performance on language tasks and setting new benchmarks for time series classification and multimodal forecasting.

Key contributions

Development of a unified transformer model for joint language and time series understanding.
Demonstration of competitive performance on both language and time series tasks.
Introduction of a new benchmark for multimodal forecasting.

Notable insights

Chronicle uses a shared transformer architecture for both modalities, allowing cross-modal capabilities to emerge naturally from shared parameters.
The model is evaluated against both multimodal and unimodal baselines, providing a more comprehensive assessment of its capabilities.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2605.20268v1 Announce Type: cross Abstract: Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.