Mental Model & Design Choices#

This page explains why HydroModPy is split into several layers instead of collapsing everything into one execution object.

Use it when the question is not only “what is this object?” but also “why does this boundary exist?”.

The repository also contains a developer glossary under docs/developers/glossary.md for shorter term-by-term definitions.

For package-by-package reading guidance, use Code Reading Guide.

At a glance#

The main execution path is:

TOML
-> workflow
-> Project
-> SimulationPlanner
-> SimulationPlan (ProcessRun...)
-> Pipeline
-> SimulationRunner
-> SolverAdapter
-> concrete solver
-> SimulationCatalog
-> Run

The main input-data path is:

TOML
-> DataLoadPlan
-> Variable
-> Manager
-> Source
-> DataCatalogDuckDB

The first path answers: “how does a configuration become a persisted result?”

The second path answers: “where do the input data come from, and how are they cached?”

Overview diagram#

The diagram below is the compact visual version of those two paths.

@startuml
title HydroModPy Overview - TOML to Run Component Diagram
left to right direction
skinparam componentStyle rectangle
skinparam wrapWidth 160
skinparam maxMessageSize 160

package "User entry points" {
  component "TOML configuration" as Toml
  component "CLI workflows\nhmp run / overview / calibration" as Cli
  component "Python API\nProject(...)" as PythonApi
}

package "Facade and planning" {
  component "Project\npublic facade" as Project
  component "SimulationPlanner" as Planner
  component "SimulationPlan\nProcessRun..." as Plan
  component "Pipeline\nstep orchestration" as Pipeline
}

package "Input data loading" {
  component "DataLoadPlan" as DataLoadPlan
  component "Variable / Manager layer" as Managers
  component "Sources\nAPIs / files / custom" as Sources
  database "DataCatalogDuckDB\ninput cache" as DataCatalog
}

package "Solver execution" {
  component "SimulationRunner" as SimulationRunner
  component "SolverAdapter registry" as AdapterRegistry
  component "Concrete solver\nMODFLOW-NWT / MODFLOW 6 /\nBoussinesq" as Solver
}

package "Persistence and reading" {
  database "SimulationCatalog" as SimulationCatalog
  component "Run\nread handle" as Run
}

Toml --> Cli : selects workflow
Toml --> PythonApi : provides config
Cli --> Project : load + dispatch
PythonApi --> Project : instantiate + run

Project --> Planner : build plan
Planner --> Plan
Plan --> Pipeline : logical schedule

Project --> DataLoadPlan : resolve input needs
DataLoadPlan --> Managers
Managers --> Sources
Managers --> DataCatalog : read/write cache
DataCatalog --> Managers : cached inputs
Managers --> Pipeline : normalized runtime inputs

Project --> Pipeline : execute phases
Pipeline --> SimulationRunner : dispatch ProcessRun
SimulationRunner --> AdapterRegistry : resolve adapter
AdapterRegistry --> Solver

Solver --> SimulationCatalog : persist outputs
Pipeline --> SimulationCatalog : ingest metadata\nand derived artifacts
Project --> SimulationCatalog : register run

SimulationCatalog --> Run : reopen persisted result
Project --> Run : return handle
@enduml

What this diagram explains:

  • the user-facing entry points,

  • the separation between planning and execution,

  • the parallel input-data path,

  • the difference between persistence and later result reading.

What it intentionally simplifies:

  • the internal pipeline steps,

  • the detailed process runtime payloads,

  • backend-specific solver internals,

  • post-processing detail after ingestion.

Why so many layers?#

The short answer is: different parts of the system evolve at different speeds.

  • TOML contracts evolve with user workflows and frontend needs.

  • Planners and pipelines evolve with orchestration needs.

  • Solver integrations evolve with backend-specific constraints.

  • Input data sources evolve with APIs, formats, and preprocessing rules.

  • Result reading evolves with post-processing and comparison workflows.

If these concerns all live in one place, small changes propagate too far. The current split tries to reduce that coupling.

Why Project exists#

Project is the user-facing facade.

It exists to provide:

  • one simple Python entry point,

  • one common language across CLI, scripts, and notebooks,

  • one place that composes planning, execution, and persistence.

Without this facade, most users would need to directly manipulate lower level objects that exist for orchestration, not for ergonomic use.

Three distinctions that matter#

workflow vs SimulationPlan vs Pipeline#

These three concepts answer different questions:

  • workflow: which user-facing mode was requested?

  • SimulationPlan: which execution units must run, and in what logical order?

  • Pipeline: how does the technical execution advance, step by step?

This separation matters because it prevents the CLI contract from being tightly bound to one specific internal implementation.

ProcessRun vs Run#

ProcessRun is a planned execution unit before runtime.

Run is a read handle over a persisted simulation result.

Keeping these separate avoids ambiguity between:

  • something that still has to be executed,

  • something that has already been written and can be queried again.

SimulationCatalog vs DataCatalogDuckDB#

HydroModPy keeps two persistent memories because they do not have the same lifecycle:

  • DataCatalogDuckDB caches input data that may be reused by many runs.

  • SimulationCatalog tracks outputs that belong to particular runs.

The important relation between them is provenance, not identity.

Another distinction worth preserving#

Variable, Manager, and Source also solve different problems:

  • Variable: what kind of scientific data is needed?

  • Manager: what loading policy should be applied?

  • Source: where does the concrete data come from?

This lets HydroModPy change a provider without renaming the scientific concept, or change the loading policy without changing the source contract.

Hydrographic Network Naming#

The hydrographic-network work is a good example of why HydroModPy keeps separate concepts for:

  • loaded input data,

  • generated geographic products,

  • persisted run features,

  • downstream display and comparison views.

The canonical persisted names are now:

  • hydrographic_network_reference for the network loaded from data.hydrography

  • hydrographic_network_generated for the network derived from geographic.river_network

The feature-store contract keeps only the canonical names. Historical filenames may still exist on disk, but they are not feature aliases:

  • river_network.shp remains the generated-network vector filename.

  • river_network_summary.json remains the generated-network summary filename.

  • streams.shp remains the reference vector filename produced by some hydrography inputs.

  • hydrography_streams is the canonical reference forcing-raster name.

This split is intentional. A manager may still write a historical filename on disk, while the runtime and comparison layers rely on the canonical feature names to avoid ambiguity.

Hydrographic Network Class Structure#

The hydrographic-network stack is intentionally split across several classes because they do not answer the same question.

@startuml
title Hydrographic Network - Class Responsibilities
left to right direction
skinparam classAttributeIconSize 0
skinparam wrapWidth 180

class "LoadResult\nhydrography" as HydrographyLoadResult {
  +fields
}

class "FieldRecord\nhydrography_streams" as HydrographyFieldRecord {
  +data
  +metadata.raster_path
  +metadata.vector_path
}

class RiverNetworkProducts {
  +streams_tif
  +active_streams_tif
  +network_shp
  +summary_json
  +river_mesh_trace
  --
  +hydrographic_network_generated_shp
  +hydrographic_network_generated_summary_json
}

class RiverMeshTrace

class HydrographicNetwork {
  +role
  +vector_path
  +raster_path
  +crs
  +metrics
  +metadata
  +river_mesh_trace
  --
  +from_hydrography_load_result(...)
  +from_river_network_products(...)
}

class HydrographicNetworks {
  +reference
  +generated
  +simulated_active
}

class HydrographicNetworkComparison {
  +reference_gdf
  +candidate_gdf
  +reference_missing_gdf
  +candidate_extra_gdf
  +reference_coverage_ratio
  +candidate_match_ratio
  +length_f1_ratio
  +to_metrics_record(...)
}

class StreamNetworkMetrics {
  +cell_field_network_distance_metrics(...)
}

class GeographicDerivedFeatures {
  +surface_topo
  +boundaries
  +rivers
  +hydrographic_networks
}

class Run {
  +available_hydrographic_network_roles()
  +has_hydrographic_network(role)
  +hydrographic_network(role)
  +hydrographic_network_comparison(...)
  +cell_field_active_mask(...)
  +cell_field_active_metrics(...)
  +cell_field_network_overlap_metrics(...)
  +cell_field_network_distance_metrics(...)
  +release_flux_network_overlap_metrics(...)
  +release_flux_network_distance_metrics(...)
}

HydrographyLoadResult --> HydrographyFieldRecord : contains
HydrographyFieldRecord --> HydrographicNetwork : converted into\nrole=\"reference\"
RiverNetworkProducts --> HydrographicNetwork : converted into\nrole=\"generated\"
RiverNetworkProducts --> RiverMeshTrace : exposes
HydrographicNetwork --> RiverMeshTrace : may carry one
HydrographicNetworks o-- HydrographicNetwork : bundles
GeographicDerivedFeatures o-- RiverNetworkProducts : technical bundle
GeographicDerivedFeatures o-- HydrographicNetworks : canonical bundle
Run --> HydrographicNetworks : reads persisted roles
Run --> HydrographicNetworkComparison : computes
Run --> StreamNetworkMetrics : delegates distance metrics
HydrographicNetworkComparison --> HydrographicNetwork : compares two networks

note bottom of RiverNetworkProducts
Legacy-compatible technical output bundle
from geographic.river_network preprocessing
end note

note bottom of HydrographicNetwork
Canonical cross-layer concept used by
storage, display and comparison
end note

note bottom of Run
User-facing read facade over persisted runs.
Comparison is only available when both
reference and generated roles exist.
end note

note bottom of StreamNetworkMetrics
Module hydromodpy.results.views (lazy distance metrics).
The current distance metric is planar and
cell-based; it is not yet the downslope
DEM-routing criterion.
end note

note bottom of StreamNetworkMetrics
Module hydromodpy.results.views (lazy distance metrics).
The current distance metric is planar and
cell-based; it is not yet the downslope
DEM-routing criterion.
end note
@enduml

The key distinction is:

  • HydrographicNetwork is the canonical cross-layer concept for one network.

  • HydrographicNetworks is only a bundle of available roles for one site/run.

  • HydrographicNetworkComparison is the result of comparing two networks.

  • RiverNetworkProducts remains the technical output bundle of the geographic.river_network preprocessing step.

Put differently:

  • one loaded reference network becomes HydrographicNetwork(role="reference")

  • one DEM-derived network becomes HydrographicNetwork(role="generated")

  • the preprocessing code may still first emit RiverNetworkProducts

  • the runtime then groups available roles in HydrographicNetworks

  • the Run facade exposes reading and comparison operations over the persisted networks

This is why HydroModPy does not use one “god object” for hydrographic networks. It keeps:

  • one class for the canonical concept,

  • one class for the role bundle,

  • one class for the comparison result,

  • one technical class for the low-level generated artifacts.

The display layer stays separate. Figures consume the canonical networks and comparison payloads, but rendering is not embedded in the data classes themselves.

Simulated-Active Role Status#

The hydrographic-network contract already reserves one third scientific role:

  • simulated_active

This role is different from the loaded reference and the DEM-derived generated network. It would describe the network that emerges from simulated drainage or stream-activity fields such as accumulation_flux and outflow_drain.

The role already exists in the class contract, but it is not auto-populated yet. Today, HydroModPy already persists the raw simulated fields and already computes useful summaries such as:

  • run.drainage_density()

  • run.persistence(variable="accumulation_flux")

  • run.cell_field_active_mask()

  • run.cell_field_active_metrics()

  • run.cell_field_network_overlap_metrics()

  • run.cell_field_network_distance_metrics()

  • the simulated_active_network figure when the run has accumulation_flux and a plottable mesh

These are lazy result views implemented in hydromodpy.results.views: they read persisted fields, mesh geometry, and hydrographic-network roles from the run without mutating the catalog.

What is still missing is the canonical storage rule that decides which thresholded or aggregated active network should become the persisted hydrographic_network_simulated_active feature.

For the detailed inventory and next design choices, use:

Diagrams worth adding#

Not every UML diagram is worth the maintenance cost. The highest-value diagrams for this part of HydroModPy would be:

  1. A component diagram for TOML -> Run.

  2. A sequence diagram for one nominal execution.

  3. A facade-object relationship diagram for Workspace, Project, SimulationCatalog, Run, and SimulationGroup.

  4. A data-loading diagram for Variable -> Manager -> Source -> cache.

  5. A simple identifier map for sim_id, simulation.run_id, and ProcessRun.id.

The last one is especially valuable because identifier confusion is hard to fix with prose alone.

Where to go next#