Validation
processing.weighting.validation
Validation sub-package (checksums, weight checks).
checksums— recode null detection and incidence-sum overcount checks (pre-balancing).weight_checks— post-balancing sanity checks comparing weighted totals against PUMS-derived control targets.
processing.weighting.validation.checksums
Incidence-sum checksums for the weighting pipeline.
Two complementary checks share a common reporting backend:
check_recode_nulls
Runs on a recoded DataFrame before aggregation. Detects records
where a control column evaluated to null (a gap in the recode
mapping). Logs a warning by default; raises with strict=True.
check_incidence_sums
Runs on the household-level incidence table after aggregation (and
after fractional imputation for survey data). For each non-structural
control it verifies that member-column sums match the expected
structural total (p_total for person controls, h_total for
household controls). Mismatches always raise ValueError.
check_recode_nulls
check_recode_nulls(
df: pl.DataFrame,
targets: list[str],
*,
level: ControlLevel,
id_col: str,
source_label: str,
strict: bool = False
) -> None
Check for records whose control recode evaluated to null.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pl.DataFrame
|
Recoded table (households or persons). |
required |
targets
|
list[str]
|
Control registry names to check. |
required |
level
|
ControlLevel
|
Which level of controls to check ( |
required |
id_col
|
str
|
Record identifier column (e.g. |
required |
source_label
|
str
|
Human label for log messages (e.g. |
required |
strict
|
bool
|
If |
False
|
check_incidence_sums
check_incidence_sums(
seed: pl.DataFrame,
targets: list[str],
*,
source_label: str,
tolerance: float = 0.0
) -> None
Raise on incidence-column mismatches vs structural totals.
For person-level controls, sum({ctrl}__*) must equal p_total.
For household-level controls, sum({ctrl}__*) must equal h_total
(always 1). Both overcounts and undercounts are checked.
After fractional imputation, values may be non-integer — set
tolerance to a small float (e.g. 0.01) to allow for
floating-point drift.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
pl.DataFrame
|
Household-level incidence table with structural columns
( |
required |
targets
|
list[str]
|
Control registry names. |
required |
source_label
|
str
|
Label for log messages. |
required |
tolerance
|
float
|
Maximum acceptable absolute deviation from the
expected sum. Use |
0.0
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If any household has an incidence-sum mismatch. |
processing.weighting.validation.weight_checks
Post-balancing weight sanity checks.
Compares survey weighted totals against PUMS-derived control targets and verifies weight consistency across the table hierarchy.
weight_sanity_checks
weight_sanity_checks(
tables: dict[str, pl.DataFrame],
control_totals: ControlTotals,
specs: list[ControlSpec],
*,
geo_col: str = "ctrl_geoid"
) -> None
Run weight sanity checks and log a summary report.