Back to today's list

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma, Yuntian Liu, Xiang Lan, Weipeng Zhou, Jun Ni, Mauro Giuffr\`e, Lingfei Qian, Xueqing Peng, Yujia Zhou, Ruey-Ling Weng, Huan He, Lu Li, Huiyuan Wang, Qingyu Chen, Andrew Loza, Laila Rasmy, Degui Zhi, Yuan Lu, Chenjie Zeng, Joshua C Denny, Lee Schwamm, Daniella Meeker, Lucila Ohno-Machado, Yong Chen, Hua Xu

Published May 7, 2026Featured #2In the daily list May 7, 2026
Daily score77.9
Editorial review8.2
Relevance0.463
Freshness0.722

Why It Matters

What makes this one worth your time

This research has the potential to transform how real-world evidence is generated and utilized in healthcare decision-making, impacting regulatory evaluations and clinical practices.

ReClaim leverages vast medical claims data to enhance disease prediction and healthcare expenditure forecasting.

Summary

The paper introduces ReClaim, a generative transformer model trained on extensive medical claims data to predict disease onset and forecast healthcare expenditures, demonstrating significant performance improvements over existing models.

Key contributions

  • Development of ReClaim, a generative transformer model specifically for medical claims data.
  • Demonstration of significant performance gains in disease prediction and healthcare expenditure forecasting compared to existing models.
  • Validation of the model's effectiveness across multiple datasets and evaluation methods.

Notable insights

  • The model's performance improved with scale, indicating that larger models can better capture complex healthcare dynamics.
  • ReClaim's ability to generalize across time periods and data sources suggests its robustness in real-world applications.

Possible limitations

  • Not stated in the abstract.

Abstract

arXiv:2605.02740v2 Announce Type: replace-cross Abstract: Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.