Test Families and Quality Roles#

HydroModPy splits tests into several families because they answer different questions. This page is the authoritative inventory of those families, the routine commands, and the right family to consult when a failure has to be interpreted.

For the architectural split between reusable benchmark logic (validation_cases/) and pytest-facing assertions (tests/validation/), see tests/validation/ vs validation_cases/ below.

Quality ladder#

Family	Main question	Typical scope	Routine command
`unit`	Does one local function, class, or schema behave correctly?	Isolated Python logic, local contracts, small fixtures.	`hmp test unit`
`integration`	Do several subsystems still cooperate?	Cross-module workflows without golden files.	`pytest tests/integration -q`
`e2e`	Can one user-facing scenario complete from start to finish?	CLI / workspace lifecycle, export-import, resume cycles.	`pytest tests/e2e -q`
`regression` (fast)	Did a known workflow output drift?	Full workflows compared to committed golden signatures.	`hmp test regression --fast`
`regression` (extensive)	Same, with heavier fixtures.	Wider workflow coverage; pre-merge / pre-release.	`hmp test regression --extensive`
`validation` (analytical)	Does the numerical result match a trusted reference?	Closed-form, semi-analytical, or stress benchmarks.	`hmp test validation --fast` / `--steady` / `--transient`
`validation` (MMS)	Does the discrete scheme converge as theory predicts?	Manufactured solutions, refinement studies.	`pytest tests/validation/mms -q`
`validation` (numerical)	Robustness on cases without a clean closed form.	Multi-backend or PETSc-backed stress cases.	`pytest tests/validation/numerical -q` / `pytest -m petsc -q`
`validation` (calibration twin)	Can the inverse chain recover a known synthetic truth?	Parameter materialization, optimizer orchestration, recovery metrics.	`pytest tests/validation/calibration -q`
`solver_sanity`	Is the external solver itself correct?	MODFLOW vs Theis, Hantush, Ogata-Banks; flopy direct models.	`pytest -m solver_sanity -q`
`validation_cases`	Can one benchmark be run outside pytest?	Reusable runners, figure-first diagnosis, report refresh.	`python -m validation_cases.run_cases ...`

The split keeps four concerns separable: software correctness, workflow stability, numerical consistency, and scientific validity.

Practical notes:

hmp test currently wraps the unit, regression, and validation suites.
pytest remains the direct entry point for the integration, e2e, MMS, and marker-based subsets.
The PETSc subset is Linux-only. On Windows, run it through WSL via install/enter_wsl_dev.sh.

What each family covers#

Unit tests (tests/unit/) protect Pydantic schemas, helpers, adapters, planners, and small runtime contracts. They do not exercise launcher behavior or persisted outputs.

Integration tests (tests/integration/) exercise realistic workflows without relying on golden datasets, catching boundary mistakes between configuration, orchestration, catalog, and post-run layers. They do not check long-term output stability.

End-to-end tests (tests/e2e/) verify a full user-visible scenario: project creation roundtrips, export-import, full simulation or calibration cycles, restart and resume.

Regression tests (tests/regression/{fast,extensive}/) compare current outputs to committed signatures under tests/regression/reference/golden_references/. A failure means the workflow changed; it does not automatically mean the workflow became wrong.

Scientific validation tests (tests/validation/) compare solver-backed or calibration-backed results to trusted references. Subfamilies:

analytical: closed-form / semi-analytical comparisons.
MMS: discrete convergence and order of accuracy.
numerical: stress and multi-backend cases without a clean closed form (PETSc-backed Boussinesq overflow, headwater cases).
calibration twin: inverse chain on synthetic truth, optimizer orchestration, recovery metrics.
solver_sanity: external solver against analytical references; may deliberately not validate the HydroModPy launcher.

Reading a failure#

The family that failed tells you where to look:

unit: one local contract or narrow behavior changed.
integration: two or more layers no longer compose.
e2e: one user-visible scenario broke across several steps.
regression: one known workflow drifted from its committed signature.
validation (analytical): one numerical result no longer matches a trusted reference within tolerance.
validation (MMS): the discrete scheme stopped converging at the expected order.
validation (calibration twin): the inverse chain no longer recovers controlled truth.
solver_sanity: the external solver or its bundled binary drifted, not the orchestration.

What to run when#

During a local refactor: unit first.
When several layers changed together: integration.
Before merging workflow-facing changes: regression --fast.
Before broader release or benchmark-sensitive changes: add regression --extensive.
Before solver, tolerance, or physics-sensitive changes: the relevant validation subset.
Before changing one benchmark or one tolerance rationale: python -m validation_cases.run_cases ... for figure-first diagnosis.

`tests/validation/` vs `validation_cases/`#

HydroModPy keeps reusable scientific benchmark logic separate from the pytest files that assert acceptance thresholds.

validation_cases/ owns benchmark definition, references, metadata (metadata.toml, tolerances.toml), shared runtime helpers under validation_cases/shared/, and direct run_case.py entry points for figure-first diagnosis.
tests/validation/ owns thin pytest entry points, marker selection (validation, steady, transient, petsc), environment-specific skipping, and explicit assertions on scalar metrics returned by the case logic.

The same case can therefore run in two modes: as an automated pytest benchmark, or as a manual diagnostic with figures and printed metrics when tolerances need rethinking.

Edit decision matrix:

Editing	Where to edit
Analytical reference, deterministic setup, new metric or plotting helper	`validation_cases/`
New marker, runtime gating, thinner assertion surface, CI selection	`tests/validation/`
A genuinely new benchmark, contract change requiring a new asserted metric	both

Local READMEs (validation_cases/README.md and tests/validation/README.md) carry the case-by-case maintainer contract; this page stays the high-level map.

Tolerances and coverage expectations#

Numerical tolerances live in exactly one place, never inline-duplicated:

tests/TOLERANCES.md is the human rationale for every scientific tolerance. The single documented scalars are consumed in validation/ and regression/ through tests/_helpers/tolerances.py::tol('<slug>') so the number exists in one place. tests/unit/test_tolerances_single_source.py guards this: every tol() call must resolve to a real row, and every documented scalar that is used inline must be referenced through tol().
Per-case envelopes stay in validation_cases/**/tolerances*.toml and are consumed at runtime via comparison.tolerances (for example profile_tol['rmse']). These are case-owned and are not duplicated into TOLERANCES.md consumption.
Machine-epsilon and purely structural tolerances in unit/ (for example atol=1e-12 on an exact linear solve) may stay inline with a one-line rationale comment; do not force them through the table.

Coverage is gated by Codecov, not by pyproject:

The real gate is codecov.yml: patch target 80 % (the diff you add must be >= 80 % covered) and project target auto (overall coverage must not drop). One Codecov flag per tier, carryforward: true, so moving a test between tiers keeps its coverage.
[tool.coverage] fail_under = 80 in pyproject.toml is not a CI gate (the unit job runs with --cov-fail-under=0); treat it as a local hint only.
New tests should raise coverage by asserting real behavior or a physical/mathematical invariant. Never add a test purely to move the number.