Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models
Alex O. Davies, Telmo de Menezes e Silva Filho, Nirav Ajmeri
Why It Matters
What makes this one worth your time
Understanding the differences in data distributions is crucial for improving the performance and generalization of AI models, especially in tabular data applications.
This study highlights the limitations of synthetic data in matching real-world distributions for tabular models.
Summary
The paper investigates the distributional relationships between real and synthetic datasets used for training tabular foundation models, revealing a significant gap in distribution that affects model performance.
Key contributions
- Characterization of three distinct datasets used for training tabular foundation models.
- Comparison of real and synthetic datasets using discriminator AUCs and k-NN coverage metrics.
- Empirical findings on the interchangeability of curated and web-scraped corpora in feature space.
Notable insights
- The TabICL synthetic prior occupies a narrow region of the space of real tables, indicating a fundamental limitation in its representational capacity.
- The lack of performance impact from the distributional gap suggests that factors other than data coverage may drive generalization in tabular models.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2605.06343v1 Announce Type: new Abstract: Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.