FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales
Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker, Kasper Johansen, Bernard Ghanem, Fernando T. Maestre, Matthew F. McCabe
Why It Matters
What makes this one worth your time
This work is relevant for AI researchers and engineers working on remote sensing applications, as it demonstrates effective model performance with smaller datasets, which is crucial for ecological and environmental monitoring.
FLORO is a geospatial foundation model that excels in ecological remote sensing with limited data.
Summary
The paper introduces FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but diverse remote sensing dataset. It accommodates sensor variability through availability-aware inputs and is evaluated on the PANGAEA benchmark, showing strong performance across various tasks despite being pretrained on a smaller corpus than competitors.
Key contributions
- Introduction of FLORO, a multimodal geospatial foundation model.
- Demonstration of strong performance on the PANGAEA benchmark with a smaller pretraining corpus.
- Use of availability-aware inputs to manage sensor variability.
Notable insights
- FLORO uses availability-aware inputs to handle sensor variability, allowing for a unified input space.
- Geo-positional encoding improves classification performance over absolute positional encoding.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.28174v1 Announce Type: cross Abstract: Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.