Testbed Workflow Architecture#

[workflow].mode = "testbed" is an orchestration workflow for method robustness studies. It deliberately sits above domain workflows such as mesh generation or flow simulation.

The key design rule is:

testbed owns cases and evidence
child runners own domain execution

This keeps method experiments reproducible without turning the testbed package into another simulation engine.

Testbed orchestration model — Fig. 427 The testbed layer expands cases into generated child configs, delegates execution, then collects evidence artifacts.#

Package Boundary#

The implementation is intentionally narrow:

Module	Responsibility
`hydromodpy.analysis.testbed.config`	Validate the `[testbed]` contract, supported subject/runner pairs, cases, metrics, and path resolution.
`hydromodpy.analysis.testbed.runtime`	Load the base TOML, materialize generated child TOMLs, delegate child execution, extract metrics, and persist evidence files.
`hydromodpy.project.dispatch.workflow`	Expose `run_testbed` as the CLI adapter.
Child runner packages	Keep ownership of mesh generation, simulation execution, solver persistence, and future transport execution.

Supported Runtime Pairs#

The current contract accepts only explicit subject/runner pairs:

Subject	Runner	Generated workflow	Purpose
`mesh`	`simulation`	`[workflow].mode = "simulation"` with `[[simulation.process]]` of `type = "mesh"`	Resolution ladders, constraint sensitivity, conformity checks.
`flow`	`simulation`	`[workflow].mode = "simulation"`	Parameter sensitivity, boundary-condition cases, solver-option robustness.
`flow`	`comparison`	`[workflow].mode = "comparison"`	Pairwise comparison campaigns and method-comparison subsets.
`transport`	`simulation`	`[workflow].mode = "simulation"`	Transport parameter sensitivity and method robustness.
`transport`	`comparison`	`[workflow].mode = "comparison"`	Pairwise transport-method comparison campaigns.

Generated children never contain [testbed]. They are ordinary child workflow TOMLs that can be opened, inspected, and in many cases run directly.

Data Flow#

One testbed run follows this sequence:

Load the testbed TOML.
Load testbed.base_config when present.
Remove [testbed] from the child payload.
Resolve path-like values to stable paths.
Merge each case overlay.
Write one child TOML under <output_root>/_generated_configs/.
Persist a dry evidence set.
If execute = true, run each child sequentially.
Rewrite cases, metrics, manifest, and report after each child outcome.

This means execute = false is not a no-op. It is a planning mode that materializes the experiment and lets a user audit generated children before spending solver time.

Evidence Model#

Testbed output evidence tree — Fig. 428 The output directory is designed for auditability: child configs first, status and metrics next, manifest and report last.#

The evidence files have stable roles:

testbed_plan.json: planned cases and generated config paths;
testbed_cases.csv: status and artifacts per case;
testbed_metrics.csv: configured metrics, or flattened numeric summaries;
testbed_manifest.json: machine-readable whole-run contract;
testbed_report.md: compact human summary.

Flow Metric Extraction#

For subject = "flow", the testbed does not parse solver files directly. It delegates to [workflow].mode = "simulation", then reopens the completed run through the result catalog. The catalog summary is flattened into flow_metrics keys that can be referenced by [[testbed.metric]].

Confirmed metric examples from the repository starter are:

flow_metrics.duration_s;
flow_metrics.n_cells;
flow_metrics.param_K;
flow_metrics.max_abs_mass_balance_percent_error;
flow_metrics.head_range_m;
flow_metrics.budget_chd_total_out for prescribed-head exchanges;
flow_metrics.budget_rcha_total_in for recharge.

The full flow_k_sensitivity matrix was executed locally with three MODFLOW 6 children. All three completed successfully with 547 cells and zero mass-balance percent error in the extracted catalog metrics. The observed head_range_m decreased from the low-K case to the high-K case, which is the expected direction for this controlled hydraulic-conductivity sensitivity test.

Comparison children are deliberately thinner: the testbed consumes the summary returned by the comparison runner and can expose those fields through [[testbed.metric]]. The comparison workflow keeps ownership of its HTML, metrics, figures, and child simulation details.

Case	`param_K`	`n_cells`	`head_range_m`	`budget_chd_total_out`
`low_k`	`5e-06`	`547`	`61.95`	`0.05158187`
`reference_k`	`1e-05`	`547`	`47.55`	`0.05158152`
`high_k`	`2e-05`	`547`	`40.91`	`0.05158121`

Extension Point#

Adding a new subject such as transport should follow the existing contract instead of introducing a one-off runner convention.

The minimum changes are:

Add the subject name to SUPPORTED_SUBJECTS.
Add allowed runner pairs to SUPPORTED_SUBJECT_RUNNERS.
Add runner-to-child-workflow mapping in RUNNER_WORKFLOWS.
Add one launcher branch in TestbedLauncher._run_case.
Add metric extraction only through runner summaries or persisted result stores.
Add dry-plan tests, execution tests with fake runners, and one documented example.

The main invariant should stay intact: testbed remains an evidence layer, not a physics layer.

Failure Semantics#

Metric declarations can set required = true. A missing required metric turns the child outcome into an explicit failure and writes that error into the manifest. With continue_on_error = false, the launcher re-raises after persisting the failed case.

This is useful for robustness studies because silent metric loss is worse than a failed case: a matrix is only comparable if each declared evidence column has the intended meaning across cases.