Data Language Models: A New Foundation Model Class for Tabular Data

Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet

Published May 8, 2026Featured #6In the daily list May 9, 2026

Open on arXiv Read PDF

Daily score70.1

Editorial review7.5

Relevance0.469

Freshness0.722

Why It Matters

What makes this one worth your time

This work addresses a significant gap in AI by providing a native understanding of tabular data, which is crucial for many real-world applications and decision-making processes.

A new foundation model for tabular data that eliminates preprocessing and improves prediction accuracy.

Summary

The paper introduces the Data Language Model (DLM), a new foundation model for tabular data that processes raw cell values without the need for preprocessing, and presents Schema-1, the first implementation that outperforms existing methods on various benchmarks.

Key contributions

Introduction of the Data Language Model (DLM) for tabular data.
Development of Schema-1, a 140M parameter model trained on a diverse dataset.
Demonstration of superior performance in row-level prediction and missing value reconstruction compared to existing models.

Notable insights

The DLM's ability to understand tabular data natively could streamline workflows by removing the need for complex preprocessing pipelines.
The model's performance on missing value reconstruction suggests that understanding the dataset's distribution is more beneficial than relying solely on external knowledge.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.06290v1 Announce Type: new Abstract: Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.