Skip to content

Validation

processing.weighting.validation

Validation sub-package (checksums, weight checks).

  • checksums — recode null detection and incidence-sum overcount checks (pre-balancing).
  • weight_checks — post-balancing sanity checks comparing weighted totals against PUMS-derived control targets.

processing.weighting.validation.checksums

Incidence-sum checksums for the weighting pipeline.

Two complementary checks share a common reporting backend:

check_recode_nulls Runs on a recoded DataFrame before aggregation. Detects records where a control column evaluated to null (a gap in the recode mapping). Logs a warning by default; raises with strict=True.

check_incidence_sums Runs on the household-level incidence table after aggregation (and after fractional imputation for survey data). For each non-structural control it verifies that member-column sums match the expected structural total (p_total for person controls, h_total for household controls). Mismatches always raise ValueError.

check_recode_nulls

check_recode_nulls(
    df: pl.DataFrame,
    targets: list[str],
    *,
    level: ControlLevel,
    id_col: str,
    source_label: str,
    strict: bool = False
) -> None

Check for records whose control recode evaluated to null.

Parameters:

Name Type Description Default
df pl.DataFrame

Recoded table (households or persons).

required
targets list[str]

Control registry names to check.

required
level ControlLevel

Which level of controls to check (HOUSEHOLD or PERSON).

required
id_col str

Record identifier column (e.g. "hh_id", "person_id").

required
source_label str

Human label for log messages (e.g. "PUMS" or "survey").

required
strict bool

If True, raise ValueError on null recodes instead of just logging a warning. PUMS recodes should always be strict (every person must land in exactly one category); survey recodes may tolerate nulls until imputation is implemented.

False

check_incidence_sums

check_incidence_sums(
    seed: pl.DataFrame,
    targets: list[str],
    *,
    source_label: str,
    tolerance: float = 0.0
) -> None

Raise on incidence-column mismatches vs structural totals.

For person-level controls, sum({ctrl}__*) must equal p_total. For household-level controls, sum({ctrl}__*) must equal h_total (always 1). Both overcounts and undercounts are checked.

After fractional imputation, values may be non-integer — set tolerance to a small float (e.g. 0.01) to allow for floating-point drift.

Parameters:

Name Type Description Default
seed pl.DataFrame

Household-level incidence table with structural columns (p_total, h_total) and {ctrl}__{member} columns.

required
targets list[str]

Control registry names.

required
source_label str

Label for log messages.

required
tolerance float

Maximum acceptable absolute deviation from the expected sum. Use 0.0 for exact integer match (PUMS), 0.01 for post-imputation survey data.

0.0

Raises:

Type Description
ValueError

If any household has an incidence-sum mismatch.

processing.weighting.validation.weight_checks

Post-balancing weight sanity checks.

Compares survey weighted totals against PUMS-derived control targets and verifies weight consistency across the table hierarchy.

weight_sanity_checks

weight_sanity_checks(
    tables: dict[str, pl.DataFrame],
    control_totals: ControlTotals,
    specs: list[ControlSpec],
    *,
    geo_col: str = "ctrl_geoid"
) -> None

Run weight sanity checks and log a summary report.