Balancing
processing.weighting.balancing
Balancing sub-package (balancer, base weights, weight propagation).
Orchestrates the core balancing loop:
- Base weights (
base_weights) -- compute initial expansion factors per zone:target_hh_pop / n_responses. - Importance (
importance) -- derive MOE-based per-control importance from PUMS replicate weights, with explicit YAML overrides. - Balancer (
balancer) -- maximum-entropy list balancing via PopulationSim'snp_balancer_numba. Runs independently per geography zone. - Weight propagation (
weight_propagation) -- carry final household weights down through the canonical table hierarchy.
processing.weighting.balancing.base_weights
Initial (base) expansion weights for the balancer.
Before balancing, each survey household needs a
starting weight that reflects its basic expansion factor — the number of
real-world households it represents. Without meaningful initial weights,
the Newton-Raphson solver starts from 1.0 and must bridge the gap to
expansion factors of hundreds or thousands, quickly slamming into the
expansion-factor constraints.
This module provides two paths:
-
Response inversion (default) —
target_hh_pop / n_responsesper zone. Works whenever PUMS-derived control totals are available (always, in our pipeline). -
Sample plan — a
SamplePlanobject mapping Census block groups to sampling strata (segments). Block-group populations are sourced from the crosswalk (Census block population summed to BG level). Each household is assigned a block group (via spatial join), mapped to a segment via the plan, and receives a segment-level initial weight:segment_pop / segment_responses.
The public entry point is
compute_base_weights,
which adds a base_weight column to the seed table.
SamplePlan
dataclass
Stratified sampling plan mapping Census block groups to segments.
Each row represents a Census block group. Block groups that share
the same sample_segment are treated as a single stratum for
initial-weight computation:
base_weight = segment_bg_pop / segment_n_responses.
Block-group population totals are sourced from the crosswalk
([PumaCrosswalk.block_group_populations][processing.weighting.data_prep.crosswalk.PumaCrosswalk.block_group_populations]), not from this table.
Attributes:
| Name | Type | Description |
|---|---|---|
strata |
pl.DataFrame
|
Required columns:
|
load_sample_plan
load_sample_plan(path: str | Path) -> SamplePlan
Read a sample-plan CSV into a SamplePlan.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to a CSV file. Must contain at minimum |
required |
Returns:
| Type | Description |
|---|---|
SamplePlan
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If path does not exist. |
ValueError
|
If required columns are missing (raised by
|
compute_base_weights
compute_base_weights(
seed: pl.DataFrame,
control_totals: ControlTotals,
targets: list[str],
geo_col: str = "ctrl_geoid",
*,
sample_plan: str | Path | SamplePlan | None = None,
bg_populations: pl.DataFrame | None = None
) -> pl.DataFrame
Add base_weight column to the seed table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
pl.DataFrame
|
Household seed table from |
required |
control_totals
|
ControlTotals
|
PUMS-derived targets from
|
required |
targets
|
list[str]
|
Control registry names (used to identify the master HH control). |
required |
geo_col
|
str
|
Geography column on seed. |
'ctrl_geoid'
|
sample_plan
|
str | Path | SamplePlan | None
|
Optional sample plan. Accepts a file path (loaded
via |
None
|
bg_populations
|
pl.DataFrame | None
|
Census block-group population totals with columns
|
None
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
seed with an additional |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no household-level control is found in targets, or if a zone has zero survey responses. |
processing.weighting.balancing.importance
MOE-based importance weight calculation.
Uses PUMS successive-difference replicate weights (WGTP1-80 / PWGTP1-80) to estimate the Margin of Error (MOE) for each weighted control total, then converts to importance weights inversely proportional to the coefficient of variation (CV).
Controls with higher sampling uncertainty (larger CV) receive lower importance so the balancer doesn't chase noisy targets. Controls absent from the returned dict (e.g. structural totals whose MOE is meaningless) fall back to the balancer's default importance.
compute_moe_importance
compute_moe_importance(
hh_df: pl.DataFrame,
person_df: pl.DataFrame,
target_names: list[str],
*,
geo_col: str = "ctrl_geoid"
) -> dict[str, float]
Compute per-control importance weights from PUMS replicate-weight MOE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hh_df
|
pl.DataFrame
|
Crosswalk-allocated PUMS households with |
required |
person_df
|
pl.DataFrame
|
Crosswalk-allocated PUMS persons with |
required |
target_names
|
list[str]
|
Control registry names to compute importance for. |
required |
geo_col
|
str
|
Geography column (default |
'ctrl_geoid'
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
|
dict[str, float]
|
computed. Controls with no PUMS records (data sparsity) are |
dict[str, float]
|
omitted — the balancer will apply its default importance. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If a target name is not in the control registry, or if required replicate weight / control columns are missing from the data. |
processing.weighting.balancing.balancer
Maximum-entropy list balancer.
Thin Polars→numpy bridge around PopulationSim's np_balancer_numba.
Runs independently per geography zone.
Algorithm
Find weight vector w closest to seed weights w₀ (KL-divergence) subject to marginal constraints:
where A is the incidence matrix and t is the target totals vector.
Implementation
Calls populationsim.balancing.balancers_numba.np_balancer_numba
directly — a pure @njit function (~120 lines) taking numpy arrays.
No PopulationSim pipeline infrastructure involved. Zones are
independent and parallelisable via ThreadPoolExecutor.
Configuration (YAML)
max_iterations: 1000
convergence_threshold: 0.001
max_expansion_factor: 10 # upper bound = initial_weight x factor
min_expansion_factor: 0.1 # lower bound = initial_weight x factor
ZoneStatus
Per-zone convergence diagnostics.
balance_weights
balance_weights(
seed: pl.DataFrame,
control_totals: ControlTotals,
targets: list[str],
balancing: BalancingConfig | None = None,
importance: ImportanceConfig | None = None,
*,
verbose: bool = True
) -> tuple[pl.DataFrame, list[ZoneStatus]]
Balance household weights to match control totals per zone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
pl.DataFrame
|
Incidence table with |
required |
control_totals
|
ControlTotals
|
Per-zone targets (with merges already applied). |
required |
targets
|
list[str]
|
Control registry names. |
required |
balancing
|
BalancingConfig | None
|
Solver bounds, iteration limits, and parallelism (defaults apply). |
None
|
importance
|
ImportanceConfig | None
|
Per-control importance weights (defaults apply). |
None
|
verbose
|
bool
|
Log per-zone convergence (default |
True
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
Tuple of |
list[ZoneStatus]
|
with columns |
tuple[pl.DataFrame, list[ZoneStatus]]
|
statuses is one |
tuple[pl.DataFrame, list[ZoneStatus]]
|
entry per zone with convergence info. |
processing.weighting.balancing.weight_propagation
Shared weight hierarchy constants and propagation helpers.
Used by both weighting and existing_weights (pre-computed)
steps to propagate household weights down through the canonical table hierarchy.
Weight derivation
| Table | Weight Column | Derivation |
|---|---|---|
| households | hh_weight |
Direct from balancer |
| persons | person_weight |
Carry forward hh_weight via hh_id |
| days | day_weight |
Carry forward person_weight via person_id |
| unlinked | unlinked_trip_weight |
Carry forward day_weight via day_id |
| linked | linked_trip_weight |
Mean of constituent unlinked_trip_weight |
| tours | tour_weight |
Mean of constituent linked_trip_weight |
Checksums (logged as warnings if violated)
sum(person_weight) ≈ sum(hh_weight x persons_per_hh)sum(day_weight) ≈ sum(person_weight x complete_travel_days)sum(unlinked_trip_weight) ≈ sum(day_weight x trips_per_day)
Completion flag
If a table has a complete boolean column, records with
complete == False receive a weight of 0 after the carry-forward
join. This ensures that incomplete records never contribute to
downstream aggregations (the aggregation step already excludes zeros).
WEIGHT_CONFIG_MAPPING
module-attribute
WEIGHT_CONFIG_MAPPING: dict[str, tuple[str, str, str]] = {
"household_weights": ("households", "hh_id", "hh_weight"),
"person_weights": ("persons", "person_id", "person_weight"),
"day_weights": ("days", "day_id", "day_weight"),
"unlinked_trip_weights": (
"unlinked_trips",
"unlinked_trip_id",
"unlinked_trip_weight",
),
"linked_trip_weights": (
"linked_trips",
"linked_trip_id",
"linked_trip_weight",
),
"joint_trip_weights": (
"joint_trips",
"joint_trip_id",
"joint_trip_weight",
),
"tour_weights": ("tours", "tour_id", "tour_weight"),
}
CARRY_FORWARD
module-attribute
CARRY_FORWARD = [
("households", "persons", "hh_id", "person_weight"),
("persons", "days", "person_id", "day_weight"),
("days", "unlinked_trips", "day_id", "unlinked_trip_weight"),
]
AGGREGATE
module-attribute
AGGREGATE = [
(
"unlinked_trips",
"linked_trips",
"linked_trip_id",
"linked_trip_weight",
),
(
"linked_trips",
"joint_trips",
"joint_trip_id",
"joint_trip_weight",
),
("linked_trips", "tours", "tour_id", "tour_weight"),
]
propagate_weights
propagate_weights(
tables: dict[str, pl.DataFrame | None],
has_weight: dict[str, str],
*,
skip: set[str] | None = None
) -> None
Carry forward and aggregate weights through the hierarchy.
Modifies tables and has_weight in place.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
dict[str, pl.DataFrame | None]
|
Mutable dict of table_name → DataFrame (or None). |
required |
has_weight
|
dict[str, str]
|
Mutable dict tracking which tables already have a
weight column and the column name.
E.g. |
required |
skip
|
set[str] | None
|
Table names to skip (e.g. tables that already have externally provided weights). |
None
|