Technical Reference

Glossary

Comprehensive reference of technical terms, algorithms, and concepts in vector databases and climate modelling. Search by term or filter by category.

41 terms found

Approximate Nearest Neighbour (ANN)

Algorithm

Search algorithm that finds k vectors approximately closest to a query vector in high-dimensional space, sacrificing perfect accuracy for computational efficiency. Achieves sub-linear query complexity O(log n) compared to O(n) for exhaustive search.

Related Terms:

HNSWIVF-PQk-NN

CMIP6

Climate

Coupled Model Intercomparison Project Phase 6. International framework coordinating climate model experiments from 50+ modelling centres worldwide. Provides multi-model ensemble projections (2015-2100) under Shared Socioeconomic Pathway (SSP) scenarios for IPCC assessment reports.

Related Terms:

EnsembleGCMSSP

Contrastive Learning

Machine Learning

Self-supervised training paradigm that learns representations by maximising similarity between positive pairs (augmented views of same instance) whilst minimising similarity to negative pairs (different instances). Common frameworks include SimCLR, MoCo, and CLIP.

Related Terms:

EmbeddingSelf-Supervised LearningSimCLR

Cosine Similarity

Distance Metric

cos(θ) = (v₁ · v₂) / (||v₁|| ||v₂||)

Measure of similarity between two non-zero vectors based on the cosine of the angle between them. Scale-invariant and bounded in [-1, 1], with 1 indicating identical direction. Preferred for normalised embeddings.

Related Terms:

Euclidean DistanceInner ProductDistance Metric

DBSCAN

Algorithm

Density-Based Spatial Clustering of Applications with Noise. Clustering algorithm that groups points with many nearby neighbours whilst marking low-density points as outliers. Requires two parameters: ε (neighbourhood radius) and min_samples (minimum cluster size).

Related Terms:

HDBSCANClusteringOutlier Detection

Downscaling

Climate

Process of deriving high-resolution climate information from coarse-resolution global climate model (GCM) output. Dynamical downscaling uses regional climate models (RCMs); statistical downscaling uses empirical relationships between large-scale and local variables.

Related Terms:

GCMRCMStatistical Downscaling

Embedding

Machine Learning

Dense vector representation of data in continuous space ℝᵈ that preserves semantic similarity. Generated by encoder networks f_θ: X → ℝᵈ trained to map raw inputs (images, text, time-series) to fixed-dimension vectors where similar inputs have small distances.

Related Terms:

EncoderLatent SpaceDimensionality

Encoder

Machine Learning

Neural network component that transforms high-dimensional input data into lower-dimensional embedding vectors. Common architectures include convolutional neural networks (CNNs) for images, recurrent networks (RNNs/LSTMs) for sequences, and transformers for both.

Related Terms:

EmbeddingCNNTransformer

Ensemble

Climate

Collection of multiple climate model simulations used to characterise uncertainty. Single-model ensembles vary initial conditions; multi-model ensembles (MMEs) combine outputs from different models. Ensemble spread quantifies projection uncertainty.

Related Terms:

CMIP6Uncertainty QuantificationMulti-Model Ensemble

Euclidean Distance (L₂)

Distance Metric

||v₁ - v₂||₂ = √(Σᵢ(v₁ᵢ - v₂ᵢ)²)

Straight-line distance between two points in Euclidean space. For vectors v₁, v₂ ∈ ℝᵈ, computed as square root of sum of squared component differences. Sensitive to scale and magnitude; grows with √d in high dimensions.

Related Terms:

Cosine SimilarityDistance MetricManhattan Distance

FAISS

Implementation

Facebook AI Similarity Search. Open-source library by Meta for efficient similarity search and clustering of dense vectors. Supports multiple index types (Flat, IVF, HNSW, PQ) with CPU and GPU implementations. Optimised for billion-scale datasets.

Related Terms:

HNSWIVF-PQVector Database

GCM (Global Climate Model)

Climate

Numerical model representing physical processes in atmosphere, ocean, land surface, and cryosphere. Solves discretised equations on 3D grid with typical horizontal resolution 50-100 km. Also called Earth System Model (ESM) when including biogeochemical cycles.

Related Terms:

CMIP6DownscalingRCM

HDBSCAN

Algorithm

Hierarchical Density-Based Spatial Clustering. Extension of DBSCAN that builds a hierarchy of clusters and extracts stable clusters across multiple density thresholds. More robust to varying density and requires fewer parameters than DBSCAN.

Related Terms:

DBSCANClusteringDensity-Based

HNSW (Hierarchical Navigable Small World)

Algorithm

Graph-based approximate nearest neighbour search algorithm. Constructs multi-layer graph where each layer is a proximity graph with progressively fewer edges. Query routing starts at top layer and refines through lower layers, achieving O(log n) complexity.

Related Terms:

ANNGraph Indexk-NN

Inner Product

Distance Metric

v₁ · v₂ = Σᵢ(v₁ᵢ × v₂ᵢ)

Dot product of two vectors: sum of products of corresponding components. Computationally efficient but sensitive to vector magnitudes. For normalised vectors, equivalent to cosine similarity. Used in maximum inner product search (MIPS).

Related Terms:

Cosine SimilarityDistance MetricNormalisation

IVF-PQ (Inverted File with Product Quantisation)

Algorithm

Two-stage approximate nearest neighbour index. Coarse quantisation via k-means clustering (IVF) narrows search space; product quantisation (PQ) compresses vectors into compact codes. Reduces memory 32-64× but increases query latency and reduces recall vs HNSW.

Related Terms:

Product Quantisationk-meansQuantisation

k-ANN (k-Approximate Nearest Neighbours)

Algorithm

Problem of finding k vectors approximately closest to query vector, allowing bounded error in distance computation. Approximate methods trade perfect accuracy (recall < 1.0) for speed, enabling sub-linear query time in high-dimensional spaces.

Related Terms:

ANNk-NNRecall

k-NN (k-Nearest Neighbours)

Algorithm

Problem of finding k vectors exactly closest to query vector according to specified distance metric. Exact k-NN requires O(nd) distance computations for n vectors in d dimensions, becoming intractable for large n. Approximate methods (k-ANN) enable practical solutions.

Related Terms:

k-ANNExhaustive SearchDistance Metric

Latency

Performance

Time required to process a single query, typically measured in milliseconds. Key performance metric for vector databases. Sub-10ms latency enables interactive applications; 50-100ms acceptable for batch processing. Increases with corpus size and dimensionality.

Related Terms:

ThroughputQuery TimePerformance

Latent Space

Machine Learning

Abstract high-dimensional space where data is represented as embeddings. Points close in latent space correspond to semantically similar inputs. Learned by encoder networks to capture meaningful structure from raw data.

Related Terms:

EmbeddingEncoderRepresentation Learning

Milvus

Implementation

Open-source distributed vector database designed for billion-scale similarity search. Supports multiple index types (HNSW, IVF, DiskANN), horizontal scaling across nodes, GPU acceleration, and hybrid search (vector + metadata filtering). Cloud-native architecture.

Related Terms:

Vector DatabaseHNSWDistributed System

Multi-Model Ensemble (MME)

Climate

Ensemble combining outputs from multiple independent climate models to characterise structural uncertainty. CMIP6 MME includes 50+ models. Ensemble mean often outperforms individual models; spread quantifies inter-model disagreement.

Related Terms:

CMIP6EnsembleUncertainty Quantification

Normalisation

Preprocessing

v_norm = v / ||v||

Transformation of vectors to unit length (||v|| = 1) by dividing each component by vector magnitude. Essential preprocessing for cosine similarity and inner product search. Removes magnitude information, preserving only directional relationships.

Related Terms:

Cosine SimilarityInner ProductPreprocessing

Outlier Detection

Analysis

Identification of data points significantly different from majority of dataset. In vector databases, outliers have low density in embedding space (few nearby neighbours). Density-based methods (DBSCAN, HDBSCAN) or distance-based thresholds used for detection.

Related Terms:

DBSCANAnomaly DetectionClustering

pgvector

Implementation

PostgreSQL extension adding vector similarity search capabilities to relational database. Supports exact and approximate (HNSW, IVF) search with multiple distance metrics. Enables hybrid queries combining vector similarity with SQL predicates. Version 0.5.0+ includes HNSW support.

Related Terms:

Vector DatabaseHNSWPostgreSQL

Product Quantisation (PQ)

Algorithm

Lossy compression technique for high-dimensional vectors. Splits vector into m subvectors, quantises each independently via k-means, stores centroid indices. Reduces memory from 4d bytes (float32) to m×log₂(k) bits whilst enabling approximate distance computation.

Related Terms:

IVF-PQQuantisationCompression

Qdrant

Implementation

Open-source vector database written in Rust, optimised for memory efficiency and query performance. Supports HNSW indexing, payload filtering, quantisation, and distributed deployment. Designed for production workloads with strong consistency guarantees.

Related Terms:

Vector DatabaseHNSWRust

Quantisation

Algorithm

Process of mapping continuous values to discrete set, reducing memory and computation. Scalar quantisation reduces precision (float32 → int8); product quantisation clusters subvectors. Introduces approximation error but enables larger-scale deployments.

Related Terms:

Product QuantisationCompressionMemory Optimisation

Query Time

Performance

Computational time required to retrieve k nearest neighbours for a single query vector. Includes distance computations, index traversal, and result ranking. Scales with corpus size (n), dimensionality (d), and k. Target: <10ms for interactive systems.

Related Terms:

LatencyPerformanceComplexity

RCM (Regional Climate Model)

Climate

High-resolution climate model covering limited geographic domain (e.g., Europe, North America). Driven at boundaries by GCM output. Typical resolution 10-50 km enables representation of mesoscale processes (orography, land-sea contrasts) absent in GCMs.

Related Terms:

GCMDownscalingCORDEX

Recall

Performance

Fraction of true nearest neighbours successfully retrieved by approximate search algorithm. Recall@k measures proportion of true k-NN found in approximate k-NN results. Typical targets: 0.95-0.99. Higher recall requires more computation (larger ef_search in HNSW).

Related Terms:

PrecisionPerformancek-ANN

Self-Supervised Learning

Machine Learning

Training paradigm that learns representations from unlabelled data by solving pretext tasks. Contrastive methods (SimCLR, MoCo) learn by distinguishing positive pairs from negatives. Enables training on large datasets without manual annotation.

Related Terms:

Contrastive LearningSimCLRUnsupervised Learning

Sharding

System Design

Horizontal partitioning of vector corpus across multiple nodes or indices. Geographic sharding divides by spatial region; temporal sharding by time period; hash sharding by vector ID. Reduces per-shard search space, improving query latency at cost of coordination overhead.

Related Terms:

Distributed SystemScalabilityPartitioning

SimCLR

Machine Learning

Simple Framework for Contrastive Learning of Visual Representations. Self-supervised method that learns embeddings by maximising agreement between differently augmented views of same image. Uses NT-Xent loss with large batch sizes and strong augmentations.

Related Terms:

Contrastive LearningSelf-Supervised LearningMoCo

Similarity Search

Algorithm

Retrieval of items most similar to query according to distance metric in embedding space. Fundamental operation in vector databases. Applications include recommendation systems, duplicate detection, and analogue retrieval in climate science.

Related Terms:

k-NNANNVector Database

SSP (Shared Socioeconomic Pathway)

Climate

Scenario framework describing plausible future socioeconomic development and greenhouse gas emissions. SSP1-1.9 (low emissions) to SSP5-8.5 (high emissions). Used in CMIP6 to drive climate model projections for 21st century.

Related Terms:

CMIP6RCPScenario

Statistical Downscaling

Climate

Empirical method deriving high-resolution climate variables from coarse GCM output using statistical relationships. Includes regression methods, weather typing, and machine learning approaches. Computationally cheaper than dynamical downscaling but assumes stationarity of relationships.

Related Terms:

DownscalingGCMRCM

Transformer

Machine Learning

Neural network architecture based on self-attention mechanism. Processes sequences in parallel (vs sequential RNNs), enabling efficient training on long sequences. Vision Transformer (ViT) applies to images via patch embeddings. Foundation of modern language and vision models.

Related Terms:

AttentionViTEncoder

Uncertainty Quantification

Climate

Characterisation of confidence in climate projections. Sources include internal variability (initial condition uncertainty), model uncertainty (structural differences), and scenario uncertainty (future emissions). Ensemble spread provides estimate of total uncertainty.

Related Terms:

EnsembleMulti-Model EnsembleConfidence Interval

Vector Database

System

Specialised database optimised for storing, indexing, and querying high-dimensional vector embeddings. Supports approximate nearest neighbour search with sub-linear complexity. Examples include Milvus, Qdrant, Pinecone, Weaviate, and pgvector.

Related Terms:

HNSWANNEmbedding

ViT (Vision Transformer)

Machine Learning

Transformer architecture adapted for image processing. Splits image into fixed-size patches, linearly embeds each patch, adds positional encoding, and processes with transformer encoder. Achieves state-of-the-art performance on image classification when trained on large datasets.

Related Terms:

TransformerCNNEncoder