Validation

processing.weighting.validation

Validation sub-package (checksums, weight checks).

checksums — recode null detection and incidence-sum overcount checks (pre-balancing).
weight_checks — post-balancing sanity checks comparing weighted totals against PUMS-derived control targets.

processing.weighting.validation.checksums

Incidence-sum checksums for the weighting pipeline.

Two complementary checks share a common reporting backend:

check_recode_nulls Runs on a recoded DataFrame before aggregation. Detects records where a control column evaluated to null (a gap in the recode mapping). Logs a warning by default; raises with strict=True.

check_incidence_sums Runs on the household-level incidence table after aggregation (and after fractional imputation for survey data). For each non-structural control it verifies that member-column sums match the expected structural total (p_total for person controls, h_total for household controls). Mismatches always raise ValueError.

check_recode_nulls

check_recode_nulls(
    df: pl.DataFrame,
    targets: list[str],
    *,
    level: ControlLevel,
    id_col: str,
    source_label: str,
    strict: bool = False
) -> None

Check for records whose control recode evaluated to null.

Parameters:

Name	Type	Description	Default
`df`	`pl.DataFrame`	Recoded table (households or persons).	required
`targets`	`list[str]`	Control registry names to check.	required
`level`	`ControlLevel`	Which level of controls to check (`HOUSEHOLD` or `PERSON`).	required
`id_col`	`str`	Record identifier column (e.g. `"hh_id"`, `"person_id"`).	required
`source_label`	`str`	Human label for log messages (e.g. `"PUMS"` or `"survey"`).	required
`strict`	`bool`	If `True`, raise `ValueError` on null recodes instead of just logging a warning. PUMS recodes should always be strict (every person must land in exactly one category); survey recodes may tolerate nulls until imputation is implemented.	`False`

check_incidence_sums

check_incidence_sums(
    seed: pl.DataFrame,
    targets: list[str],
    *,
    source_label: str,
    tolerance: float = 0.0
) -> None

Raise on incidence-column mismatches vs structural totals.

For person-level controls, sum({ctrl}__*) must equal p_total. For household-level controls, sum({ctrl}__*) must equal h_total (always 1). Both overcounts and undercounts are checked.

After fractional imputation, values may be non-integer — set tolerance to a small float (e.g. 0.01) to allow for floating-point drift.

Parameters:

Name	Type	Description	Default
`seed`	`pl.DataFrame`	Household-level incidence table with structural columns (`p_total`, `h_total`) and `{ctrl}__{member}` columns.	required
`targets`	`list[str]`	Control registry names.	required
`source_label`	`str`	Label for log messages.	required
`tolerance`	`float`	Maximum acceptable absolute deviation from the expected sum. Use `0.0` for exact integer match (PUMS), `0.01` for post-imputation survey data.	`0.0`

Raises:

Type	Description
`ValueError`	If any household has an incidence-sum mismatch.

processing.weighting.validation.weight_checks

Post-balancing weight sanity checks.

Compares survey weighted totals against PUMS-derived control targets and verifies weight consistency across the table hierarchy.

weight_sanity_checks

weight_sanity_checks(
    tables: dict[str, pl.DataFrame],
    control_totals: ControlTotals,
    specs: list[ControlSpec],
    *,
    geo_col: str = "ctrl_geoid"
) -> None

Run weight sanity checks and log a summary report.