TerraFlow Architecture Overview¶
TerraFlow is a config-driven geospatial pipeline for agricultural suitability modeling. Given a raster (land-cover GeoTIFF) and a climate CSV, it produces scored, location-stamped cell features with full provenance tracking.
Data Flow¶
YAML config → PipelineConfig (Pydantic v2)
↓
DataCatalog ← ingest.py (metadata only; no pixel reads)
↓
run_fingerprint ← core/run_identity.py (SHA256 of config + inputs)
↓
geo.py → clip raster to ROI bbox; reproject to EPSG:4326
↓
climate.py → ClimateInterpolator (linear | kriging | IDW)
↓
model.py → suitability_score() + suitability_label()
↓
pipeline.py → write artifacts to output_dir/runs/<fingerprint>/
Module Map¶
| Module | Responsibility |
|---|---|
cli.py |
Typer CLI — run, sensitivity, validate, export subcommands |
config.py |
Pydantic v2 models — PipelineConfig, ModelParams, ROI, SensitivityConfig, ValidationConfig, ExportConfig |
ingest.py |
RasterLayer, ClimateLayer, DataCatalog — metadata only, no pixel I/O |
geo.py |
ROI clipping, CRS reprojection; invariant: all output in EPSG:4326 |
climate.py |
ClimateInterpolator — linear / kriging / IDW; LOOCV variogram selection |
model.py |
suitability_score(), suitability_label(), suitability_score_array() |
pipeline.py |
Orchestration — fingerprint → load → clip → interpolate → score → write |
sensitivity.py |
Sobol' / Morris via SALib — triggered by terraflow sensitivity |
validation.py |
Spatial block CV, Cohen's kappa, Moran's I — triggered by terraflow validate |
export.py |
H3 hexagonal re-index — triggered by terraflow export --format h3 |
core/run_identity.py |
Deterministic SHA256 fingerprint of canonicalized config + input files |
Output Artifacts¶
All artifacts land under <output_dir>/runs/<run_fingerprint>/:
| File | Schema | Contents |
|---|---|---|
features.parquet |
v1 | run_id, cell_id, lat, lon, v_index, mean_temp, total_rain, score, label (+ kriging std + CI columns when configured) |
manifest.json |
v1 | Config snapshot, input fingerprints, code version, git SHA |
report.json |
v1 | Coverage fractions, raster/climate stats, score stats, timings; kriging_loocv, kriging_diagnostics, uncertainty, and validation blocks appended when the relevant features are enabled |
results.csv |
— | Same data as features.parquet in CSV format (backward compatibility) |
sensitivity_report.json |
— | Sobol' and/or Morris indices per ModelParams weight (written by terraflow sensitivity) |
h3_resolution_N.parquet |
— | H3-indexed features at resolution N (written by terraflow export --format h3) |
Key Invariants¶
- CRS: always EPSG:4326 in output.
geo.pyreprojects any input that differs. - Determinism: identical inputs always produce the same
run_fingerprint. Cell sampling is seeded from the fingerprint SHA256 so that the same config yields the same cell set across independent runs. - Cache: if all three required artifacts exist in the run directory, the pipeline returns immediately without re-running.
- Coverage: runs fail if no valid raster cells are found in the ROI; coverage fractions are always reported in
report.json. - Atomicity: all artifacts are written with a write-to-temp + rename pattern to prevent partial writes.
Reproducibility Model¶
The run_fingerprint is a SHA256 over:
1. Canonicalized (key-sorted) YAML config
2. A stable ROI geometry hash (bbox dict or GeoJSON file hash)
3. SHA256 fingerprints of all input files (raster + CSV)
This makes each run directory immutable. Re-running with identical inputs is a no-op; changing any input or config parameter produces a new directory.
Geospatial Correctness¶
- ROI bounds in any CRS are reprojected to raster CRS before windowing (all four corners, then axis-aligned bounding box to handle non-linear projections).
- Degenerate windows (NaN dimensions after reprojection) raise
ValueErrorwith diagnostic information. - NoData cells are masked and excluded from sampling; coverage fractions are reported.
- Climate station coordinates are validated against
[-90, 90]/[-180, 180]ranges at load time.
Non-goals¶
- Remote dataset downloads or cloud-hosted rasters.
- Real-time or streaming data ingestion.
- GUI or web application layer.
- General-purpose raster processing (use
rioxarrayorrasterstatsinstead).