Test Families and Quality Roles#
HydroModPy splits tests into several families because they answer different questions. This page is the authoritative inventory of those families, the routine commands, and the right family to consult when a failure has to be interpreted.
For the architectural split between reusable benchmark logic
(validation_cases/) and pytest-facing assertions
(tests/validation/), see tests/validation/ vs validation_cases/ below.
Quality ladder#
Family |
Main question |
Typical scope |
Routine command |
|---|---|---|---|
|
Does one local function, class, or schema behave correctly? |
Isolated Python logic, local contracts, small fixtures. |
|
|
Do several subsystems still cooperate? |
Cross-module workflows without golden files. |
|
|
Can one user-facing scenario complete from start to finish? |
CLI / workspace lifecycle, export-import, resume cycles. |
|
|
Did a known workflow output drift? |
Full workflows compared to committed golden signatures. |
|
|
Same, with heavier fixtures. |
Wider workflow coverage; pre-merge / pre-release. |
|
|
Does the numerical result match a trusted reference? |
Closed-form, semi-analytical, or stress benchmarks. |
|
|
Does the discrete scheme converge as theory predicts? |
Manufactured solutions, refinement studies. |
|
|
Robustness on cases without a clean closed form. |
Multi-backend or PETSc-backed stress cases. |
|
|
Can the inverse chain recover a known synthetic truth? |
Parameter materialization, optimizer orchestration, recovery metrics. |
|
|
Is the external solver itself correct? |
MODFLOW vs Theis, Hantush, Ogata-Banks; flopy direct models. |
|
|
Can one benchmark be run outside pytest? |
Reusable runners, figure-first diagnosis, report refresh. |
|
The split keeps four concerns separable: software correctness, workflow stability, numerical consistency, and scientific validity.
Practical notes:
hmp testcurrently wraps the unit, regression, and validation suites.pytestremains the direct entry point for the integration, e2e, MMS, and marker-based subsets.The PETSc subset is Linux-only. On Windows, run it through WSL via
install/enter_wsl_dev.sh.
What each family covers#
Unit tests (tests/unit/) protect Pydantic schemas, helpers,
adapters, planners, and small runtime contracts. They do not exercise
launcher behavior or persisted outputs.
Integration tests (tests/integration/) exercise realistic
workflows without relying on golden datasets, catching boundary
mistakes between configuration, orchestration, catalog, and post-run
layers. They do not check long-term output stability.
End-to-end tests (tests/e2e/) verify a full user-visible scenario:
project creation roundtrips, export-import, full simulation or
calibration cycles, restart and resume.
Regression tests (tests/regression/{fast,extensive}/) compare
current outputs to committed signatures under
tests/regression/reference/golden_references/. A failure means
the workflow changed; it does not automatically mean the workflow
became wrong.
Scientific validation tests (tests/validation/) compare
solver-backed or calibration-backed results to trusted references.
Subfamilies:
analytical: closed-form / semi-analytical comparisons.
MMS: discrete convergence and order of accuracy.
numerical: stress and multi-backend cases without a clean closed form (PETSc-backed Boussinesq overflow, headwater cases).
calibration twin: inverse chain on synthetic truth, optimizer orchestration, recovery metrics.
solver_sanity: external solver against analytical references; may deliberately not validate the HydroModPy launcher.
Reading a failure#
The family that failed tells you where to look:
unit: one local contract or narrow behavior changed.integration: two or more layers no longer compose.e2e: one user-visible scenario broke across several steps.regression: one known workflow drifted from its committed signature.validation(analytical): one numerical result no longer matches a trusted reference within tolerance.validation(MMS): the discrete scheme stopped converging at the expected order.validation(calibration twin): the inverse chain no longer recovers controlled truth.solver_sanity: the external solver or its bundled binary drifted, not the orchestration.
What to run when#
During a local refactor:
unitfirst.When several layers changed together:
integration.Before merging workflow-facing changes:
regression --fast.Before broader release or benchmark-sensitive changes: add
regression --extensive.Before solver, tolerance, or physics-sensitive changes: the relevant
validationsubset.Before changing one benchmark or one tolerance rationale:
python -m validation_cases.run_cases ...for figure-first diagnosis.
tests/validation/ vs validation_cases/#
HydroModPy keeps reusable scientific benchmark logic separate from the pytest files that assert acceptance thresholds.
validation_cases/owns benchmark definition, references, metadata (metadata.toml,tolerances.toml), shared runtime helpers undervalidation_cases/shared/, and directrun_case.pyentry points for figure-first diagnosis.tests/validation/owns thin pytest entry points, marker selection (validation,steady,transient,petsc), environment-specific skipping, and explicit assertions on scalar metrics returned by the case logic.
The same case can therefore run in two modes: as an automated pytest benchmark, or as a manual diagnostic with figures and printed metrics when tolerances need rethinking.
Edit decision matrix:
Editing |
Where to edit |
|---|---|
Analytical reference, deterministic setup, new metric or plotting helper |
|
New marker, runtime gating, thinner assertion surface, CI selection |
|
A genuinely new benchmark, contract change requiring a new asserted metric |
both |
Local READMEs (validation_cases/README.md and
tests/validation/README.md) carry the case-by-case maintainer
contract; this page stays the high-level map.
Tolerances and coverage expectations#
Numerical tolerances live in exactly one place, never inline-duplicated:
tests/TOLERANCES.mdis the human rationale for every scientific tolerance. The single documented scalars are consumed invalidation/andregression/throughtests/_helpers/tolerances.py::tol('<slug>')so the number exists in one place.tests/unit/test_tolerances_single_source.pyguards this: everytol()call must resolve to a real row, and every documented scalar that is used inline must be referenced throughtol().Per-case envelopes stay in
validation_cases/**/tolerances*.tomland are consumed at runtime viacomparison.tolerances(for exampleprofile_tol['rmse']). These are case-owned and are not duplicated intoTOLERANCES.mdconsumption.Machine-epsilon and purely structural tolerances in
unit/(for exampleatol=1e-12on an exact linear solve) may stay inline with a one-line rationale comment; do not force them through the table.
Coverage is gated by Codecov, not by pyproject:
The real gate is
codecov.yml:patchtarget 80 % (the diff you add must be >= 80 % covered) andprojecttargetauto(overall coverage must not drop). One Codecov flag per tier,carryforward: true, so moving a test between tiers keeps its coverage.[tool.coverage] fail_under = 80inpyproject.tomlis not a CI gate (the unit job runs with--cov-fail-under=0); treat it as a local hint only.New tests should raise coverage by asserting real behavior or a physical/mathematical invariant. Never add a test purely to move the number.