Skip to content

Reproducibility Guarantees

TerraFlow is designed so that a researcher can cite a specific run_fingerprint and reviewers, collaborators, or auditors can regenerate byte-identical outputs from the same inputs and configuration. This page documents exactly what that guarantee covers, what it does not cover, and how to cite a run.

What the run fingerprint includes

A run_fingerprint is a deterministic SHA-256 hex digest computed from:

  1. The entire YAML configuration, canonicalised as JSON (keys sorted, booleans/floats normalised). This means every field — raster_path, raster_band, max_cells, every model_params weight, the complete climate block (including interpolation_method and variogram_mode), the roi block, and any optional sensitivity / validation / export config — contributes to the fingerprint.
  2. The SHA-256 content hash of the input raster (GeoTIFF bytes).
  3. The SHA-256 content hash of the climate CSV (file bytes).
  4. The optional reference CSV hash for terraflow validate, when present.

The fingerprint is the directory name under <output_dir>/runs/ where all artifacts land. Identical inputs and config ⇒ identical fingerprint ⇒ the cache hit path returns the previously computed features.parquet, manifest.json, and report.json without recomputation.

What the run fingerprint excludes (by design)

These are intentionally not part of the fingerprint:

  • File mtimes. Copying a file across machines rewrites the mtime; only the content hash is load-bearing.
  • Absolute paths. Two users with inputs at different paths but the same content get the same fingerprint.
  • Wall-clock time. manifest.json records created_at_utc for audit, but it does not feed back into the hash.
  • Host metadata. Hostname, username, git SHA of the working tree are recorded in manifest.json for provenance but never influence the hash.
  • Installed package versions. See the limits section below — this is a known non-guarantee, not an oversight.

What is strongly reproducible (bit-identical)

Under the same Python version, same numpy / scipy / pykrige wheels, and same OS/architecture:

  • The sampled set of ROI cells (rng.choice is seeded from the fingerprint).
  • The per-cell lat, lon, v_index, mean_temp, total_rain, score, and label values.
  • The per-cell Monte Carlo confidence-interval columns (score_ci_low, score_ci_high) when uncertainty_samples > 0.
  • All Sobol' and Morris indices from terraflow sensitivity.
  • All spatial-block CV, kappa, and Moran's-I values from terraflow validate.

The sampling path is exercised across the max_cells == n_valid_cells and max_cells > n_valid_cells boundaries by the regression tests in tests/test_determinism.py::TestMaxCellsBoundary.

Known sources of non-determinism and their limits

Reproducibility is a floor, not a ceiling. These are the documented departures:

Variogram fitting (pykrige extended mode)

terraflow's extended variogram_mode fits nested variogram families using scipy.optimize.curve_fit, which invokes the Levenberg–Marquardt solver. curve_fit is deterministic given identical inputs and initial guesses, but across different scipy versions the solver's convergence path can shift by a few ULPs, producing very small differences in variogram parameters (psill, range_, nugget). Standard-mode variograms (spherical / exponential / Gaussian) are selected by LOOCV RMSE and are not affected.

Delaunay triangulation tie-breaking (scipy)

Linear and nearest-neighbour interpolation via scipy.interpolate.griddata use scipy.spatial.Delaunay under the hood. When four or more stations lie on a common circle, the triangulation is ambiguous and the specific triangulation chosen depends on the underlying qhull version. In practice this manifests only with degenerate station layouts; real-world networks rarely trigger it.

Floating-point summation order

Across OS / BLAS combinations (Accelerate on macOS, OpenBLAS on Linux, MKL on Intel), numpy's internal summation order for large reductions can differ in the last bit. Score values will match to ~1e-12 across platforms, not bit-identical.

Cache invalidation on schema bump

Running an older TerraFlow against a newer features.parquet (or vice versa) is handled: the pipeline reads the embedded terraflow_schema_version on cache hit and invalidates the cache, logging a WARNING, when the version does not match the current FEATURES_SCHEMA_VERSION. See tests/test_pipeline.py::TestCacheSchemaVersionInvalidation.

Station-coordinate deduplication

If the climate CSV contains duplicate lat/lon rows (common with aggregated NOAA summaries), TerraFlow averages them at ClimateInterpolator construction. The merge is deterministic and bit-identical, but it does mean the underlying input CSV and the effective station set used for kriging are different objects. The averaging step is logged at INFO level with the before/after station counts.

How to cite a specific run

For a publication, cite the run the way you would cite a software version:

Results produced by TerraFlow v0.2.2
(https://pypi.org/project/terraflow-agro/0.2.2),
run_fingerprint=<hex>, inputs sha256:<raster>, <climate>.

All three fields — the package version, the fingerprint, and the input hashes — are recorded in the manifest.json of every run. The raw payload can be pasted verbatim into supplementary materials.

Reproducibility check-list for reviewers

  1. Install the exact version: pip install terraflow-agro==0.2.2.
  2. Recompute the input hashes: sha256sum <raster.tif> <climate.csv>.
  3. Point the config at those paths and run terraflow run -c cfg.yml.
  4. Verify that the resulting directory name under runs/ matches the published run_fingerprint.
  5. Open manifest.json and confirm the input_fingerprints.sha256 values match your step-2 hashes.

If all four match, the artifacts are byte-identical to the cited run (within the known floating-point limits above).