Skip to content

TerraFlow Architecture Overview

TerraFlow is a config-driven geospatial pipeline for agricultural suitability modeling. Given a raster (land-cover GeoTIFF) and a climate CSV, it produces scored, location-stamped cell features with full provenance tracking.

Data Flow

YAML config → PipelineConfig (Pydantic v2)
DataCatalog  ← ingest.py (metadata only; no pixel reads)
run_fingerprint ← core/run_identity.py (SHA256 of config + inputs)
geo.py  → clip raster to ROI bbox; reproject to EPSG:4326
climate.py → ClimateInterpolator (linear | kriging | IDW)
model.py → suitability_score() + suitability_label()
pipeline.py → write artifacts to output_dir/runs/<fingerprint>/

Module Map

Module Responsibility
cli.py Typer CLI — run, sensitivity, validate, export subcommands
config.py Pydantic v2 models — PipelineConfig, ModelParams, ROI, SensitivityConfig, ValidationConfig, ExportConfig
ingest.py RasterLayer, ClimateLayer, DataCatalog — metadata only, no pixel I/O
geo.py ROI clipping, CRS reprojection; invariant: all output in EPSG:4326
climate.py ClimateInterpolator — linear / kriging / IDW; LOOCV variogram selection
model.py suitability_score(), suitability_label(), suitability_score_array()
pipeline.py Orchestration — fingerprint → load → clip → interpolate → score → write
sensitivity.py Sobol' / Morris via SALib — triggered by terraflow sensitivity
validation.py Spatial block CV, Cohen's kappa, Moran's I — triggered by terraflow validate
export.py H3 hexagonal re-index — triggered by terraflow export --format h3
core/run_identity.py Deterministic SHA256 fingerprint of canonicalized config + input files

Output Artifacts

All artifacts land under <output_dir>/runs/<run_fingerprint>/:

File Schema Contents
features.parquet v1 run_id, cell_id, lat, lon, v_index, mean_temp, total_rain, score, label (+ kriging std + CI columns when configured)
manifest.json v1 Config snapshot, input fingerprints, code version, git SHA
report.json v1 Coverage fractions, raster/climate stats, score stats, timings; kriging_loocv, kriging_diagnostics, uncertainty, and validation blocks appended when the relevant features are enabled
results.csv Same data as features.parquet in CSV format (backward compatibility)
sensitivity_report.json Sobol' and/or Morris indices per ModelParams weight (written by terraflow sensitivity)
h3_resolution_N.parquet H3-indexed features at resolution N (written by terraflow export --format h3)

Key Invariants

  • CRS: always EPSG:4326 in output. geo.py reprojects any input that differs.
  • Determinism: identical inputs always produce the same run_fingerprint. Cell sampling is seeded from the fingerprint SHA256 so that the same config yields the same cell set across independent runs.
  • Cache: if all three required artifacts exist in the run directory, the pipeline returns immediately without re-running.
  • Coverage: runs fail if no valid raster cells are found in the ROI; coverage fractions are always reported in report.json.
  • Atomicity: all artifacts are written with a write-to-temp + rename pattern to prevent partial writes.

Reproducibility Model

The run_fingerprint is a SHA256 over: 1. Canonicalized (key-sorted) YAML config 2. A stable ROI geometry hash (bbox dict or GeoJSON file hash) 3. SHA256 fingerprints of all input files (raster + CSV)

This makes each run directory immutable. Re-running with identical inputs is a no-op; changing any input or config parameter produces a new directory.

Geospatial Correctness

  • ROI bounds in any CRS are reprojected to raster CRS before windowing (all four corners, then axis-aligned bounding box to handle non-linear projections).
  • Degenerate windows (NaN dimensions after reprojection) raise ValueError with diagnostic information.
  • NoData cells are masked and excluded from sampling; coverage fractions are reported.
  • Climate station coordinates are validated against [-90, 90] / [-180, 180] ranges at load time.

Non-goals

  • Remote dataset downloads or cloud-hosted rasters.
  • Real-time or streaming data ingestion.
  • GUI or web application layer.
  • General-purpose raster processing (use rioxarray or rasterstats instead).