OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis

Tienyu Chang, Zhen Chen, Renjie Liang, Jinyu Ding, Jie Xu, Sunu Mathew, Amir Reza Hajrasouliha, Andrew J. Saykin, Ruogu Fang, Yu Huang, Jiang Bian, Qingyu Chen

Published May 6, 2026Featured #1In the daily list May 7, 2026

Open on arXiv Read PDF

Daily score78.6

Editorial review8.2

Relevance0.468

Freshness0.722

Why It Matters

What makes this one worth your time

This research addresses a critical gap in ophthalmic AI by enabling effective multi-modality analysis, which is essential for accurate clinical diagnosis and could significantly enhance patient care.

OphMAE revolutionizes ophthalmological diagnosis by merging 3D and 2D imaging modalities.

Summary

The paper introduces OphMAE, a foundation model that integrates volumetric and planar imaging for ophthalmological diagnosis, demonstrating state-of-the-art performance across multiple diagnostic tasks using a large dataset of OCT images.

Key contributions

Development of the Ophthalmic multimodal Masked Autoencoder (OphMAE) for integrating 3D and 2D OCT imaging.
Implementation of a novel adaptive inference mechanism that enhances diagnostic performance across various tasks.
Demonstration of robust performance metrics, including high AUC scores across multiple ophthalmic conditions.

Notable insights

The cross-modal fusion architecture allows for effective integration of different imaging modalities, which is often a challenge in medical AI.
The model's ability to maintain high diagnostic accuracy with limited labeled samples highlights its potential for deployment in resource-constrained environments.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2605.02714v1 Announce Type: cross Abstract: The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware. Here, we present the Ophthalmic multimodal Masked Autoencoder (OphMAE), a multi-imaging foundation model engineered to synergize the volumetric depth of 3D Optical Coherence Tomography (OCT) with the planar context of 2D en face OCT. By implementing a novel cross-modal fusion architecture and a unique adaptive inference mechanism, OphMAE was pre-trained on a massive dataset with of 183,875 paired OCT images derived from 32,765 patients. In a rigorous benchmark encompassing 17 diverse diagnostic tasks with 48,340 paired OCT images from 8,191 patients, the model demonstrated state-of-the-art performance, achieving an Area Under the Curve (AUC) of 96.9% for Age-related Macular Degeneration (AMD) and 97.2% for Diabetic Macular Edema (DME), consistently surpassing existing single-modal and multimodal foundation models. Crucially, OphMAE exhibits robust engineering adaptability: it maintains high diagnostic accuracy, such as 93.7\% AUC for AMD, even when restricted to single-modality 2D inputs, and demonstrates exceptional data efficiency by retaining 95.7% AUC with as few as 500 labeled samples. This work establishes a scalable and adaptable framework for ophthalmic AI, ensuring robust performance across different tasks.