Architecture Boundaries¶

TerraFlow splits the pipeline into two primary layers:

flowchart TD
    subgraph Ingest["Ingest Layer"]
        A[File System]
        B[Cloud Storage]
        C[APIs]
        A & B & C --> D[Data Loaders]
    end

    D --> E[Validated Data]

    subgraph Core["Core Layer"]
        E --> F[Configuration Validation]
        F --> G[ROI Clipping]
        G --> H[Climate Aggregation]
        H --> I[Suitability Scoring]
        I --> J[Artifact Generation]
    end

    J --> K[results.csv]
    J --> L[fingerprint.json]
    J --> M[results.html]

    style Ingest fill:#00b0ff,stroke:#0091ea,color:#fff
    style Core fill:#2d8a55,stroke:#1e5c3a,color:#fff
    style E fill:#40a86e,stroke:#2d6a4f,color:#fff

Ingest layer¶

Ingest-layer details will be finalized later.

Core layer¶

The core layer:

Validates configuration.
Clips raster data to the ROI.
Aggregates climate metrics.
Computes suitability scores and labels.
Writes run artifacts.

The core layer should not contain any file system discovery or remote fetch logic; it relies on ingest to provide all data.

Why the boundary matters¶

Keeping ingestion and core computation separate ensures that:

Deterministic & Testable

Pipeline logic remains deterministic and testable without filesystem dependencies.

Extensible Data Sources

Future data sources (e.g., cloud buckets) can be added without rewriting scoring logic.

Audit-Friendly

The system remains audit-friendly for reproducible research with clear data provenance.