Run Identity¶

Every pipeline execution is identified by a deterministic run_fingerprint computed as a base64url-encoded SHA-256:

run_fingerprint = base64url( sha256({
    config: sha256( canonical_config_json ),
    roi:    sha256( normalized_geometry_WKB ),
    inputs: [ { sha256, size_bytes } … ]   ← sorted, mtime excluded
}) )

Why mtime is excluded¶

File modification timestamps (mtime) are intentionally not part of the fingerprint hash. Including them would make the fingerprint non-deterministic across filesystem copies, CI clones, and archive extractions — defeating the purpose of content-based identity. mtime is recorded in manifest.json for human-readable provenance tracing only.

Components¶

1. Canonical config¶

The raw YAML config dict is serialised to JSON with sorted keys and minimal separators (no extra whitespace). This ensures the hash is stable regardless of key insertion order in the source YAML.

json.dumps(config_dict, sort_keys=True, separators=(",", ":"))

2. ROI geometry hash¶

The region of interest is canonicalised using Shapely:

Repair invalid geometries (buffer(0)).
Snap coordinates to a 1e-7° grid (set_precision).
Normalise vertex order (normalize).
Serialise to WKB (2D, 7 decimal places).
SHA-256 the WKB bytes.

This means the same geographic polygon expressed in different vertex orders produces the same hash.

Supported ROI forms in the YAML config:

YAML form	Behaviour
`roi.type: bbox` with `xmin/ymin/xmax/ymax`	Converts bbox to Shapely `box`, then hashes
`roi.path: path/to/roi.geojson`	Loads GeoJSON Polygon / MultiPolygon / Feature / FeatureCollection and hashes

3. Input file fingerprints¶

Each input file (raster, climate CSV) is read in 8 MiB chunks and its SHA-256 digest and byte-size are recorded. These are sorted by (sha256, size_bytes) before hashing so input order in the config does not affect the fingerprint.

Run directory¶

The run directory is created at:

<output_dir>/runs/<run_fingerprint>/

If all three required artifacts (features.parquet, manifest.json, report.json) already exist in that directory when the pipeline is invoked with the same inputs, the pipeline returns the cached result without re-scoring (no-op rerun).

Reproducibility guarantees¶

Claim	Mechanism
Same config + same ROI + same input files → same fingerprint	Content-based SHA-256, mtime excluded
Same fingerprint → same run directory path	Deterministic base64url encoding
Same run directory → same `features.parquet` schema	Schema version frozen in Parquet metadata
Different input content → different fingerprint	SHA-256 collision probability negligible
Filesystem copy / CI clone → same fingerprint	mtime excluded from hash

Limitations¶

The fingerprint covers declared input files (raster, climate CSV, ROI GeoJSON) only. If a pipeline step reads additional files not listed in the config, those are not tracked.
Random cell sampling (max_cells < n_valid_cells) means the cell set may differ between runs even with the same fingerprint — because random.sample is seeded by Python's default PRNG, not by the fingerprint. For fully deterministic cell selection, seed the PRNG via the config (planned for a future release; tracked in ROADMAP).