Detailed analysis of production-scale vector database deployments for climate science applications, including system architecture, performance metrics, and research outcomes.
Earth Genome deployed a production vector database system to enable similarity-based retrieval across petabyte-scale satellite imagery archives for detecting illegal mining operations in protected regions.
Data Volume
2.5 PB
Vector Count
10⁹+
Query Latency
<50ms
Detection Accuracy
94%
Earth Genome, non-profit environmental monitoring organisation
Rapid identification of mining activity patterns across multi-temporal satellite imagery
Qdrant vector database with ResNet-50 embeddings, d=512
Illegal mining operations in protected rainforest regions exhibit characteristic spatial signatures in optical satellite imagery: vegetation removal, exposed soil, access roads, and sediment plumes in nearby waterways. Traditional change detection algorithms suffer from high false-positive rates due to natural disturbances (fires, floods) and require manual tuning of spectral thresholds for each sensor and geographic region.
Earth Genome required a system capable of: (1) ingesting multi-temporal Sentinel-2 and Landsat imagery at 10-30m resolution covering 5×10⁶ km² of tropical forest; (2) enabling analysts to query with example mining sites and retrieve visually similar locations; (3) processing new imagery acquisitions within 24 hours of satellite overpass; (4) achieving >90% precision to minimise false alerts requiring ground verification.
System Pipeline:
The system processes 2×10⁴ new tiles per day (Sentinel-2 5-day revisit × coverage area). Embedding generation requires 0.8 GPU-hours per 10⁴ tiles on NVIDIA A100. HNSW index updates occur nightly via incremental insertion, maintaining query latency <50ms despite corpus growth.
Validation against ground-truth data from field surveys and very-high-resolution commercial imagery (0.5m) demonstrated 94% precision and 96% recall for mining sites >1 hectare. The system reduced analyst workload by 85% compared to manual image interpretation, enabling monitoring of 10× larger geographic area with equivalent staffing.
Domain-Specific Fine-Tuning Critical
Pre-trained ImageNet embeddings achieved only 78% precision. Fine-tuning on 50k domain-specific examples improved precision to 94%, demonstrating necessity of climate/Earth observation-specific training data.
Geographic Sharding Reduces Latency
Initial single-shard deployment exhibited 200ms p95 latency. Geographic sharding (8 regions) with query routing reduced latency to 48ms by limiting search space and improving cache locality.
Similarity Threshold Calibration Required
Cosine similarity scores required empirical calibration against validation set. Threshold of 0.85 balanced precision (minimize false alerts) against recall (detect emerging sites early).
Research Publication
Technical implementation details and evaluation methodology documented in Earth Genome technical report (2023).
Read Full ReportNASA developed a self-supervised learning system enabling similarity-based retrieval across their multi-petabyte Earth observation archives, facilitating rapid discovery of analogous atmospheric and surface phenomena.
Archive Size
45 PB
Instruments
20+
Embedding Dimension
d=768
User Queries (2024)
15k+
NASA's Earth Observing System Data and Information System (EOSDIS) archives 45 PB of data from 20+ satellite instruments spanning atmospheric composition, land surface, ocean, and cryosphere observations. Traditional metadata-based search (temporal, spatial, instrument filters) requires users to know precise data product names and acquisition parameters, creating barriers for interdisciplinary research and serendipitous discovery.
The similarity search tool enables content-based retrieval: researchers upload or select a reference image (e.g., a cyclone, wildfire smoke plume, or sea ice pattern) and retrieve visually similar phenomena across the entire archive, regardless of instrument, time period, or geographic location. This facilitates comparative analysis, analogue identification for forecasting, and discovery of previously unrecognised patterns.
The system employs self-supervised contrastive learning (SimCLR framework) to train a vision transformer (ViT-B/16) encoder without requiring manual labels. Training data comprises 10⁷ image patches extracted from MODIS, VIIRS, and Landsat archives. Positive pairs are generated via temporal augmentation (same location, different times) and spatial augmentation (adjacent tiles), whilst negative pairs are randomly sampled distant locations.
Model Architecture:
Researchers queried with Hurricane Katrina (2005) infrared imagery to retrieve 150 morphologically similar cyclones from 2000-2024 archive. Clustering in embedding space identified distinct structural categories (symmetric, asymmetric, eyewall replacement) correlating with intensity change rates.
Atmospheric ScienceQuery with reference smoke plume from 2020 Australian bushfires retrieved similar plumes across 20 years of MODIS data. Temporal analysis revealed increasing frequency of extreme smoke events in Northern Hemisphere mid-latitudes (2010-2024 vs 2000-2009).
Air QualitySimilarity search on Arctic sea ice imagery identified recurring polynya (open water) patterns associated with specific atmospheric forcing. Cross-modal queries (optical → radar) enabled validation of ice type classifications across sensor modalities.
CryosphereQueries with deforestation examples from Amazon retrieved similar patterns in Southeast Asia and Central Africa, enabling rapid assessment of global forest loss hotspots without region-specific algorithm tuning.
Land UseSince public release in 2023, the tool has processed 15,000+ user queries from 2,500+ researchers across 45 countries. User surveys indicate 68% of queries led to discovery of previously unknown analogous phenomena, and 42% of users reported the tool enabled research questions that would have been infeasible with traditional metadata search.
The system has been integrated into NASA's Earthdata Search interface and is being extended to support multi-modal queries (e.g., query with atmospheric profile to find similar satellite-observed cloud patterns) and temporal sequence retrieval (find similar evolution trajectories for weather systems).
Public Access
The NASA Earth Observation Similarity Search tool is publicly accessible through the Earthdata Dashboard.
Access ToolResearch consortium applied vector embeddings and density-based clustering to CMIP6 multi-model ensemble, identifying distinct scenario families and improving probabilistic climate projection skill scores.
Models Analysed
52
Trajectories
1,248
Scenario Clusters
7
Skill Improvement
+22%
CMIP6 comprises 52 independent climate models from international modelling centres, each with multiple ensemble members, yielding 1,248 individual projection trajectories (2015-2100) under SSP scenarios. Traditional ensemble analysis treats models as exchangeable, assigning equal weights or simple performance-based weights. However, models exhibit structural dependencies (shared parameterisations, common ancestry) and varying degrees of independence.
The research objective was to: (1) identify coherent scenario families representing genuinely distinct futures; (2) detect outlier models warranting investigation for systematic biases; (3) develop data-driven weighting schemes that account for model interdependencies whilst preserving ensemble diversity; (4) improve probabilistic projection skill relative to equal-weight ensembles.
Each model trajectory (monthly mean surface temperature and precipitation fields, 2015-2100, 1,032 timesteps) was encoded into a fixed-dimension vector via a temporal convolutional network (TCN) trained with triplet loss. Positive pairs comprised different ensemble members from the same model; negative pairs comprised trajectories from different models. The resulting embeddings (d=512) preserve temporal dynamics whilst mapping to a common latent space.
Analysis Pipeline:
Low climate sensitivity (ECS: 2.5-3.2°C), gradual warming, stable AMOC
2100: +2.1°CMid-range sensitivity (ECS: 3.2-4.1°C), ensemble consensus
2100: +2.8°CHigh sensitivity (ECS: 4.5-5.8°C), accelerated Arctic amplification
2100: +3.9°CSeven distinct clusters were identified, representing scenarios ranging from conservative (low climate sensitivity, gradual change) to high-impact (accelerated warming, potential tipping points). Twenty-three outlier trajectories were flagged for investigation; subsequent analysis revealed parameterisation errors in three models and unrealistic aerosol forcing in two models.
Cluster-aware weighting (inverse within-cluster variance, preserving between-cluster diversity) improved hindcast CRPS by 22% relative to equal-weight ensemble and 15% relative to performance-based weighting. The approach successfully down-weighted outliers whilst maintaining representation of high-sensitivity scenarios critical for risk assessment.
The clustering analysis revealed that traditional model independence assumptions are violated: 68% of ensemble variance is concentrated in 3 of 7 clusters, indicating structural dependencies. The data-driven weighting scheme has been adopted by regional climate services for generating probabilistic projections, improving calibration of uncertainty estimates for stakeholder decision-making.
The methodology is being extended to CMIP7 preparation, with plans to incorporate process-based constraints (observational emergent relationships) into the embedding space to further improve physical realism and reduce spurious scenario diversity.
Research Publication
Methodology and results published in Geophysical Research Letters (2024) with open-source code and pre-computed embeddings available.
Open AccessAll three case studies required domain-specific training or fine-tuning. Pre-trained models (ImageNet, CLIP) provided useful initialisation but achieved 15-25% lower performance than climate-specific encoders. Investment in curated training datasets yields substantial returns.
HNSW consistently outperformed IVF-PQ for recall-latency trade-off when memory constraints permitted. Geographic or temporal sharding improved query performance 3-5× by reducing search space. Quantisation viable for memory-constrained deployments with acceptable recall degradation.
Learned embeddings lack physical interpretability, complicating error diagnosis and limiting mechanistic insight. Hybrid approaches combining vector similarity with physics-based constraints show promise for improving trustworthiness in operational systems.
Successful adoption required intuitive query interfaces (map-based selection, example uploads) and interpretable result displays (spatial clustering, confidence scores). Technical sophistication must be hidden behind accessible user experiences for broad research community impact.