Back to top papers

GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

Maram Hasan, Md Aminur Hossain, Savitra Roy, Souparna Bhowmik, Ayush V. Patel, Mainak Singha, Subhasis Chaudhuri, Muhammad Haris Khan, Biplab Banerjee

Score8.500
LLMn/a
Embedding0.497
Recencyn/a

Feedback

Why It Matters

This paper addresses the challenge of semantically grounded foundation modeling in remote sensing by providing a large-scale dataset and a novel pretraining framework, potentially enhancing the accuracy and applicability of remote sensing models.

Contributions

  • Introduction of GeoMeld, a large-scale multimodal dataset for remote sensing.
  • Development of GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding, JEPA representation learning, and caption-vision contrastive alignment.

Insights

  • The integration of semantically grounded language supervision with multimodal datasets can significantly improve cross-sensor robustness and transferability in remote sensing models.

Limitations

  • The dataset and framework may require significant computational resources for training and deployment.

Tags

  • alignment
  • data
  • multimodal
  • vision_language

Abstract

arXiv:2604.10591v1 Announce Type: cross Abstract: Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.