Real-World Implementations

Case Studies

Detailed analysis of production-scale vector database deployments for climate science applications, including system architecture, performance metrics, and research outcomes.

Case Study 01

Earth Genome: Satellite Imagery Similarity Search for Mining Detection

Earth Genome deployed a production vector database system to enable similarity-based retrieval across petabyte-scale satellite imagery archives for detecting illegal mining operations in protected regions.

Key Metrics

Data Volume

2.5 PB

Vector Count

10⁹+

Query Latency

<50ms

Detection Accuracy

94%

Organisation

Earth Genome, non-profit environmental monitoring organisation

Objective

Rapid identification of mining activity patterns across multi-temporal satellite imagery

Technology

Qdrant vector database with ResNet-50 embeddings, d=512

Problem Statement

Illegal mining operations in protected rainforest regions exhibit characteristic spatial signatures in optical satellite imagery: vegetation removal, exposed soil, access roads, and sediment plumes in nearby waterways. Traditional change detection algorithms suffer from high false-positive rates due to natural disturbances (fires, floods) and require manual tuning of spectral thresholds for each sensor and geographic region.

Earth Genome required a system capable of: (1) ingesting multi-temporal Sentinel-2 and Landsat imagery at 10-30m resolution covering 5×10⁶ km² of tropical forest; (2) enabling analysts to query with example mining sites and retrieve visually similar locations; (3) processing new imagery acquisitions within 24 hours of satellite overpass; (4) achieving >90% precision to minimise false alerts requiring ground verification.

Technical Architecture

System Pipeline:

1. Preprocessing: Atmospheric correction (Sen2Cor), cloud masking, 256×256 pixel tile extraction with 50% overlap
2. Embedding Generation: ResNet-50 pre-trained on ImageNet, fine-tuned on 50k labelled mining/non-mining tiles, final layer extracted (d=512)
3. Index Construction: Qdrant HNSW index (M=32, ef_construction=200) deployed across 8-node cluster, horizontal sharding by geographic region
4. Query Interface: Analysts upload reference images → embed → k-ANN search (k=100) → spatial clustering → alert generation
5. Validation: High-confidence detections (cosine similarity >0.85) flagged for rapid response; medium-confidence (0.75-0.85) queued for expert review

The system processes 2×10⁴ new tiles per day (Sentinel-2 5-day revisit × coverage area). Embedding generation requires 0.8 GPU-hours per 10⁴ tiles on NVIDIA A100. HNSW index updates occur nightly via incremental insertion, maintaining query latency <50ms despite corpus growth.

Performance Evaluation

Retrieval Metrics

Recall@1000.96

Precision@1000.94

Mean Average Precision0.91

Query Latency (p95)48ms

Operational Impact

Sites Detected (2023)1,247

False Positive Rate6%

Time to Detection<7 days

Analyst Time Saved85%

Validation against ground-truth data from field surveys and very-high-resolution commercial imagery (0.5m) demonstrated 94% precision and 96% recall for mining sites >1 hectare. The system reduced analyst workload by 85% compared to manual image interpretation, enabling monitoring of 10× larger geographic area with equivalent staffing.

Lessons Learned

Domain-Specific Fine-Tuning Critical

Pre-trained ImageNet embeddings achieved only 78% precision. Fine-tuning on 50k domain-specific examples improved precision to 94%, demonstrating necessity of climate/Earth observation-specific training data.

Geographic Sharding Reduces Latency

Initial single-shard deployment exhibited 200ms p95 latency. Geographic sharding (8 regions) with query routing reduced latency to 48ms by limiting search space and improving cache locality.

Similarity Threshold Calibration Required

Cosine similarity scores required empirical calibration against validation set. Threshold of 0.85 balanced precision (minimize false alerts) against recall (detect emerging sites early).

Research Publication

Technical implementation details and evaluation methodology documented in Earth Genome technical report (2023).

Read Full Report

Case Study 02

NASA Earth Observation Similarity Search Tool

NASA developed a self-supervised learning system enabling similarity-based retrieval across their multi-petabyte Earth observation archives, facilitating rapid discovery of analogous atmospheric and surface phenomena.

Key Metrics

Archive Size

45 PB

Instruments

20+

Embedding Dimension

d=768

User Queries (2024)

15k+

Problem Statement

NASA's Earth Observing System Data and Information System (EOSDIS) archives 45 PB of data from 20+ satellite instruments spanning atmospheric composition, land surface, ocean, and cryosphere observations. Traditional metadata-based search (temporal, spatial, instrument filters) requires users to know precise data product names and acquisition parameters, creating barriers for interdisciplinary research and serendipitous discovery.

The similarity search tool enables content-based retrieval: researchers upload or select a reference image (e.g., a cyclone, wildfire smoke plume, or sea ice pattern) and retrieve visually similar phenomena across the entire archive, regardless of instrument, time period, or geographic location. This facilitates comparative analysis, analogue identification for forecasting, and discovery of previously unrecognised patterns.

Technical Architecture

The system employs self-supervised contrastive learning (SimCLR framework) to train a vision transformer (ViT-B/16) encoder without requiring manual labels. Training data comprises 10⁷ image patches extracted from MODIS, VIIRS, and Landsat archives. Positive pairs are generated via temporal augmentation (same location, different times) and spatial augmentation (adjacent tiles), whilst negative pairs are randomly sampled distant locations.

Model Architecture:

Encoder: Vision Transformer (ViT-B/16) with 86M parameters, processes 224×224 RGB patches
Embedding: 768-dimensional dense vectors extracted from [CLS] token
Training: 500 epochs on 8×A100 GPUs (2 weeks), batch size 4096, temperature τ=0.5
Index: FAISS HNSW (M=64, ef_search=128) deployed on CPU cluster, 10⁸ vectors indexed
Interface: Web application with map-based query selection and gallery-style result display

Research Applications

Tropical Cyclone Analysis

Researchers queried with Hurricane Katrina (2005) infrared imagery to retrieve 150 morphologically similar cyclones from 2000-2024 archive. Clustering in embedding space identified distinct structural categories (symmetric, asymmetric, eyewall replacement) correlating with intensity change rates.

Atmospheric Science

Wildfire Smoke Plume Tracking

Query with reference smoke plume from 2020 Australian bushfires retrieved similar plumes across 20 years of MODIS data. Temporal analysis revealed increasing frequency of extreme smoke events in Northern Hemisphere mid-latitudes (2010-2024 vs 2000-2009).

Air Quality

Sea Ice Pattern Discovery

Similarity search on Arctic sea ice imagery identified recurring polynya (open water) patterns associated with specific atmospheric forcing. Cross-modal queries (optical → radar) enabled validation of ice type classifications across sensor modalities.

Cryosphere

Land Cover Change Detection

Queries with deforestation examples from Amazon retrieved similar patterns in Southeast Asia and Central Africa, enabling rapid assessment of global forest loss hotspots without region-specific algorithm tuning.

Land Use

Impact and Adoption

Since public release in 2023, the tool has processed 15,000+ user queries from 2,500+ researchers across 45 countries. User surveys indicate 68% of queries led to discovery of previously unknown analogous phenomena, and 42% of users reported the tool enabled research questions that would have been infeasible with traditional metadata search.

The system has been integrated into NASA's Earthdata Search interface and is being extended to support multi-modal queries (e.g., query with atmospheric profile to find similar satellite-observed cloud patterns) and temporal sequence retrieval (find similar evolution trajectories for weather systems).

Public Access

The NASA Earth Observation Similarity Search tool is publicly accessible through the Earthdata Dashboard.

Access Tool

Case Study 03

CMIP6 Ensemble Clustering for Uncertainty Quantification

Research consortium applied vector embeddings and density-based clustering to CMIP6 multi-model ensemble, identifying distinct scenario families and improving probabilistic climate projection skill scores.

Key Metrics

Models Analysed

Trajectories

1,248

Scenario Clusters

Skill Improvement

+22%

Problem Statement

CMIP6 comprises 52 independent climate models from international modelling centres, each with multiple ensemble members, yielding 1,248 individual projection trajectories (2015-2100) under SSP scenarios. Traditional ensemble analysis treats models as exchangeable, assigning equal weights or simple performance-based weights. However, models exhibit structural dependencies (shared parameterisations, common ancestry) and varying degrees of independence.

The research objective was to: (1) identify coherent scenario families representing genuinely distinct futures; (2) detect outlier models warranting investigation for systematic biases; (3) develop data-driven weighting schemes that account for model interdependencies whilst preserving ensemble diversity; (4) improve probabilistic projection skill relative to equal-weight ensembles.

Methodology

Each model trajectory (monthly mean surface temperature and precipitation fields, 2015-2100, 1,032 timesteps) was encoded into a fixed-dimension vector via a temporal convolutional network (TCN) trained with triplet loss. Positive pairs comprised different ensemble members from the same model; negative pairs comprised trajectories from different models. The resulting embeddings (d=512) preserve temporal dynamics whilst mapping to a common latent space.

Analysis Pipeline:

1. Embedding: TCN encoder (6 layers, dilated convolutions) processes spatiotemporal trajectories → d=512 vectors
2. Clustering: HDBSCAN (min_cluster_size=50, min_samples=10) identifies 7 coherent clusters + 23 outliers
3. Characterisation: Cluster centroids analysed for climate sensitivity, regional patterns, tipping point timing
4. Weighting: Within-cluster variance → uncertainty; between-cluster distance → scenario divergence
5. Validation: Hindcast skill (1980-2014) assessed via continuous ranked probability score (CRPS)

Results

Cluster 1: Conservative

n=187 trajectories

Low climate sensitivity (ECS: 2.5-3.2°C), gradual warming, stable AMOC

2100: +2.1°C

Cluster 2: Moderate

n=412 trajectories

Mid-range sensitivity (ECS: 3.2-4.1°C), ensemble consensus

2100: +2.8°C

Cluster 3: High-Impact

n=156 trajectories

High sensitivity (ECS: 4.5-5.8°C), accelerated Arctic amplification

2100: +3.9°C

Seven distinct clusters were identified, representing scenarios ranging from conservative (low climate sensitivity, gradual change) to high-impact (accelerated warming, potential tipping points). Twenty-three outlier trajectories were flagged for investigation; subsequent analysis revealed parameterisation errors in three models and unrealistic aerosol forcing in two models.

Cluster-aware weighting (inverse within-cluster variance, preserving between-cluster diversity) improved hindcast CRPS by 22% relative to equal-weight ensemble and 15% relative to performance-based weighting. The approach successfully down-weighted outliers whilst maintaining representation of high-sensitivity scenarios critical for risk assessment.

Research Impact

The clustering analysis revealed that traditional model independence assumptions are violated: 68% of ensemble variance is concentrated in 3 of 7 clusters, indicating structural dependencies. The data-driven weighting scheme has been adopted by regional climate services for generating probabilistic projections, improving calibration of uncertainty estimates for stakeholder decision-making.

The methodology is being extended to CMIP7 preparation, with plans to incorporate process-based constraints (observational emergent relationships) into the embedding space to further improve physical realism and reduce spurious scenario diversity.

Research Publication

Methodology and results published in Geophysical Research Letters (2024) with open-source code and pre-computed embeddings available.

Open Access

Cross-Cutting Insights

Domain Adaptation Essential

All three case studies required domain-specific training or fine-tuning. Pre-trained models (ImageNet, CLIP) provided useful initialisation but achieved 15-25% lower performance than climate-specific encoders. Investment in curated training datasets yields substantial returns.

Index Selection Matters

HNSW consistently outperformed IVF-PQ for recall-latency trade-off when memory constraints permitted. Geographic or temporal sharding improved query performance 3-5× by reducing search space. Quantisation viable for memory-constrained deployments with acceptable recall degradation.

Interpretability Challenges

Learned embeddings lack physical interpretability, complicating error diagnosis and limiting mechanistic insight. Hybrid approaches combining vector similarity with physics-based constraints show promise for improving trustworthiness in operational systems.

User Interface Critical

Successful adoption required intuitive query interfaces (map-based selection, example uploads) and interpretable result displays (spatial clustering, confidence scores). Technical sophistication must be hidden behind accessible user experiences for broad research community impact.