Balancing

processing.weighting.balancing

Balancing sub-package (balancer, base weights, weight propagation).

Orchestrates the core balancing loop:

Base weights (base_weights) -- compute initial expansion factors per zone: target_hh_pop / n_responses.
Importance (importance) -- derive MOE-based per-control importance from PUMS replicate weights, with explicit YAML overrides.
Balancer (balancer) -- maximum-entropy list balancing via PopulationSim's np_balancer_numba. Runs independently per geography zone.
Weight propagation (weight_propagation) -- carry final household weights down through the canonical table hierarchy.

processing.weighting.balancing.base_weights

Initial (base) expansion weights for the balancer.

Before balancing, each survey household needs a starting weight that reflects its basic expansion factor — the number of real-world households it represents. Without meaningful initial weights, the Newton-Raphson solver starts from 1.0 and must bridge the gap to expansion factors of hundreds or thousands, quickly slamming into the expansion-factor constraints.

This module provides two paths:

Response inversion (default) — target_hh_pop / n_responses per zone. Works whenever PUMS-derived control totals are available (always, in our pipeline).
Sample plan — a SamplePlan object mapping Census block groups to sampling strata (segments). Block-group populations are sourced from the crosswalk (Census block population summed to BG level). Each household is assigned a block group (via spatial join), mapped to a segment via the plan, and receives a segment-level initial weight: segment_pop / segment_responses.

The public entry point is compute_base_weights, which adds a base_weight column to the seed table.

SamplePlan `dataclass`

Stratified sampling plan mapping Census block groups to segments.

Each row represents a Census block group. Block groups that share the same sample_segment are treated as a single stratum for initial-weight computation: base_weight = segment_bg_pop / segment_n_responses.

Block-group population totals are sourced from the crosswalk ([PumaCrosswalk.block_group_populations][processing.weighting.data_prep.crosswalk.PumaCrosswalk.block_group_populations]), not from this table.

Attributes:

Name	Type	Description
`strata`	`pl.DataFrame`	Required columns: `bg_geo_id` (str) — 12-character Census block-group FIPS code. `sample_segment` (str) — sampling-stratum label. All block groups sharing a segment get the same base weight.

load_sample_plan

load_sample_plan(path: str | Path) -> SamplePlan

Read a sample-plan CSV into a SamplePlan.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to a CSV file. Must contain at minimum `bg_geo_id` and `sample_segment`. Block-group population totals are sourced from the crosswalk, not from the CSV.	required

Returns:

Type	Description
`SamplePlan`	`SamplePlan`.

Raises:

Type	Description
`FileNotFoundError`	If path does not exist.
`ValueError`	If required columns are missing (raised by `SamplePlan`).

compute_base_weights

compute_base_weights(
    seed: pl.DataFrame,
    control_totals: ControlTotals,
    targets: list[str],
    geo_col: str = "ctrl_geoid",
    *,
    sample_plan: str | Path | SamplePlan | None = None,
    bg_populations: pl.DataFrame | None = None
) -> pl.DataFrame

Add base_weight column to the seed table.

Parameters:

Name	Type	Description	Default
`seed`	`pl.DataFrame`	Household seed table from `build_seed_table`. Must contain `hh_id` and geo_col. When using a sample plan, must also contain `bg_geo_id` (assigned via [`PumaCrosswalk.assign_block_groups`][processing.weighting.data_prep.crosswalk.PumaCrosswalk.assign_block_groups]).	required
`control_totals`	`ControlTotals`	PUMS-derived targets from `build_control_totals`.	required
`targets`	`list[str]`	Control registry names (used to identify the master HH control).	required
`geo_col`	`str`	Geography column on seed.	`'ctrl_geoid'`
`sample_plan`	`str \| Path \| SamplePlan \| None`	Optional sample plan. Accepts a file path (loaded via `load_sample_plan`) or an already-loaded `SamplePlan`. When `None`, default response inversion is used.	`None`
`bg_populations`	`pl.DataFrame \| None`	Census block-group population totals with columns `[bg_geo_id, bg_population]`. Required when sample_plan is provided; sourced from [`PumaCrosswalk.block_group_populations`][processing.weighting.data_prep.crosswalk.PumaCrosswalk.block_group_populations].	`None`

Returns:

Type	Description
`pl.DataFrame`	seed with an additional `base_weight` column (Float64).

Raises:

Type	Description
`ValueError`	If no household-level control is found in targets, or if a zone has zero survey responses.

processing.weighting.balancing.importance

MOE-based importance weight calculation.

Uses PUMS successive-difference replicate weights (WGTP1-80 / PWGTP1-80) to estimate the Margin of Error (MOE) for each weighted control total, then converts to importance weights inversely proportional to the coefficient of variation (CV).

Controls with higher sampling uncertainty (larger CV) receive lower importance so the balancer doesn't chase noisy targets. Controls absent from the returned dict (e.g. structural totals whose MOE is meaningless) fall back to the balancer's default importance.

compute_moe_importance

compute_moe_importance(
    hh_df: pl.DataFrame,
    person_df: pl.DataFrame,
    target_names: list[str],
    *,
    geo_col: str = "ctrl_geoid"
) -> dict[str, float]

Compute per-control importance weights from PUMS replicate-weight MOE.

Parameters:

Name	Type	Description	Default
`hh_df`	`pl.DataFrame`	Crosswalk-allocated PUMS households with `_xw_WGTP` and `_xw_WGTP1` … `_xw_WGTP80` columns (plus recoded controls).	required
`person_df`	`pl.DataFrame`	Crosswalk-allocated PUMS persons with `_xw_PWGTP` and `_xw_PWGTP1` … `_xw_PWGTP80` columns (plus recoded controls).	required
`target_names`	`list[str]`	Control registry names to compute importance for.	required
`geo_col`	`str`	Geography column (default `"ctrl_geoid"`).	`'ctrl_geoid'`

Returns:

Type	Description
`dict[str, float]`	`{control_name: importance}` for controls where MOE could be
`dict[str, float]`	computed. Controls with no PUMS records (data sparsity) are
`dict[str, float]`	omitted — the balancer will apply its default importance.

Raises:

Type	Description
`ValueError`	If a target name is not in the control registry, or if required replicate weight / control columns are missing from the data.

processing.weighting.balancing.balancer

Maximum-entropy list balancer.

Thin Polars→numpy bridge around PopulationSim's np_balancer_numba. Runs independently per geography zone.

Algorithm

Find weight vector w closest to seed weights w₀ (KL-divergence) subject to marginal constraints:

\[ \min \sum_i w_i \ln(w_i / w_{0i}) \quad \text{s.t.} \quad Aw = t,\; w_i \ge 0 \]

where A is the incidence matrix and t is the target totals vector.

Implementation

Calls populationsim.balancing.balancers_numba.np_balancer_numba directly — a pure @njit function (~120 lines) taking numpy arrays. No PopulationSim pipeline infrastructure involved. Zones are independent and parallelisable via ThreadPoolExecutor.

Configuration (YAML)

max_iterations: 1000
convergence_threshold: 0.001
max_expansion_factor: 10    # upper bound = initial_weight x factor
min_expansion_factor: 0.1   # lower bound = initial_weight x factor

ZoneStatus

Per-zone convergence diagnostics.

balance_weights

balance_weights(
    seed: pl.DataFrame,
    control_totals: ControlTotals,
    targets: list[str],
    balancing: BalancingConfig | None = None,
    importance: ImportanceConfig | None = None,
    *,
    verbose: bool = True
) -> tuple[pl.DataFrame, list[ZoneStatus]]

Balance household weights to match control totals per zone.

Parameters:

Name	Type	Description	Default
`seed`	`pl.DataFrame`	Incidence table with `hh_id`, `ctrl_geoid`, `base_weight`, and pivoted control columns (`{ctrl}__{member}` or structural). All merges (global and zone-specific) must already be applied.	required
`control_totals`	`ControlTotals`	Per-zone targets (with merges already applied).	required
`targets`	`list[str]`	Control registry names.	required
`balancing`	`BalancingConfig \| None`	Solver bounds, iteration limits, and parallelism (defaults apply).	`None`
`importance`	`ImportanceConfig \| None`	Per-control importance weights (defaults apply).	`None`
`verbose`	`bool`	Log per-zone convergence (default `True`).	`True`

Returns:

Type	Description
`pl.DataFrame`	Tuple of `(weights, statuses)` where weights is a DataFrame
`list[ZoneStatus]`	with columns `hh_id`, `hh_weight`, `geo_id`, and
`tuple[pl.DataFrame, list[ZoneStatus]]`	statuses is one `ZoneStatus`
`tuple[pl.DataFrame, list[ZoneStatus]]`	entry per zone with convergence info.

processing.weighting.balancing.weight_propagation

Shared weight hierarchy constants and propagation helpers.

Used by both weighting and existing_weights (pre-computed) steps to propagate household weights down through the canonical table hierarchy.

Weight derivation

Table	Weight Column	Derivation
households	`hh_weight`	Direct from balancer
persons	`person_weight`	Carry forward `hh_weight` via `hh_id`
days	`day_weight`	Carry forward `person_weight` via `person_id`
unlinked	`unlinked_trip_weight`	Carry forward `day_weight` via `day_id`
linked	`linked_trip_weight`	Mean of constituent `unlinked_trip_weight`
tours	`tour_weight`	Mean of constituent `linked_trip_weight`

Checksums (logged as warnings if violated)

sum(person_weight) ≈ sum(hh_weight x persons_per_hh)
sum(day_weight) ≈ sum(person_weight x complete_travel_days)
sum(unlinked_trip_weight) ≈ sum(day_weight x trips_per_day)

Completion flag

If a table has a complete boolean column, records with complete == False receive a weight of 0 after the carry-forward join. This ensures that incomplete records never contribute to downstream aggregations (the aggregation step already excludes zeros).

WEIGHT_CONFIG_MAPPING `module-attribute`

WEIGHT_CONFIG_MAPPING: dict[str, tuple[str, str, str]] = {
    "household_weights": ("households", "hh_id", "hh_weight"),
    "person_weights": ("persons", "person_id", "person_weight"),
    "day_weights": ("days", "day_id", "day_weight"),
    "unlinked_trip_weights": (
        "unlinked_trips",
        "unlinked_trip_id",
        "unlinked_trip_weight",
    ),
    "linked_trip_weights": (
        "linked_trips",
        "linked_trip_id",
        "linked_trip_weight",
    ),
    "joint_trip_weights": (
        "joint_trips",
        "joint_trip_id",
        "joint_trip_weight",
    ),
    "tour_weights": ("tours", "tour_id", "tour_weight"),
}

CARRY_FORWARD `module-attribute`

CARRY_FORWARD = [
    ("households", "persons", "hh_id", "person_weight"),
    ("persons", "days", "person_id", "day_weight"),
    ("days", "unlinked_trips", "day_id", "unlinked_trip_weight"),
]

AGGREGATE `module-attribute`

AGGREGATE = [
    (
        "unlinked_trips",
        "linked_trips",
        "linked_trip_id",
        "linked_trip_weight",
    ),
    (
        "linked_trips",
        "joint_trips",
        "joint_trip_id",
        "joint_trip_weight",
    ),
    ("linked_trips", "tours", "tour_id", "tour_weight"),
]

propagate_weights

propagate_weights(
    tables: dict[str, pl.DataFrame | None],
    has_weight: dict[str, str],
    *,
    skip: set[str] | None = None
) -> None

Carry forward and aggregate weights through the hierarchy.

Modifies tables and has_weight in place.

Parameters:

Name	Type	Description	Default
`tables`	`dict[str, pl.DataFrame \| None]`	Mutable dict of table_name → DataFrame (or None).	required
`has_weight`	`dict[str, str]`	Mutable dict tracking which tables already have a weight column and the column name. E.g. `{"households": "hh_weight"}`.	required
`skip`	`set[str] \| None`	Table names to skip (e.g. tables that already have externally provided weights).	`None`

Balancing

processing.weighting.balancing

processing.weighting.balancing.base_weights

SamplePlan dataclass

load_sample_plan

compute_base_weights

processing.weighting.balancing.importance

compute_moe_importance

processing.weighting.balancing.balancer

Algorithm

Implementation

Configuration (YAML)

ZoneStatus

balance_weights

processing.weighting.balancing.weight_propagation

Weight derivation

Checksums (logged as warnings if violated)

Completion flag

WEIGHT_CONFIG_MAPPING module-attribute

CARRY_FORWARD module-attribute

AGGREGATE module-attribute

propagate_weights

SamplePlan `dataclass`

WEIGHT_CONFIG_MAPPING `module-attribute`

CARRY_FORWARD `module-attribute`

AGGREGATE `module-attribute`