Skip to content

Balancing

processing.weighting.balancing

Balancing sub-package (balancer, base weights, weight propagation).

Orchestrates the core balancing loop:

  1. Base weights (base_weights) -- compute initial expansion factors per zone: target_hh_pop / n_responses.
  2. Importance (importance) -- derive MOE-based per-control importance from PUMS replicate weights, with explicit YAML overrides.
  3. Balancer (balancer) -- maximum-entropy list balancing via PopulationSim's np_balancer_numba. Runs independently per geography zone.
  4. Weight propagation (weight_propagation) -- carry final household weights down through the canonical table hierarchy.

processing.weighting.balancing.base_weights

Initial (base) expansion weights for the balancer.

Before balancing, each survey household needs a starting weight that reflects its basic expansion factor — the number of real-world households it represents. Without meaningful initial weights, the Newton-Raphson solver starts from 1.0 and must bridge the gap to expansion factors of hundreds or thousands, quickly slamming into the expansion-factor constraints.

This module provides two paths:

  1. Response inversion (default) — target_hh_pop / n_responses per zone. Works whenever PUMS-derived control totals are available (always, in our pipeline).

  2. Sample plan — a SamplePlan object mapping Census block groups to sampling strata (segments). Block-group populations are sourced from the crosswalk (Census block population summed to BG level). Each household is assigned a block group (via spatial join), mapped to a segment via the plan, and receives a segment-level initial weight: segment_pop / segment_responses.

The public entry point is compute_base_weights, which adds a base_weight column to the seed table.

SamplePlan dataclass

Stratified sampling plan mapping Census block groups to segments.

Each row represents a Census block group. Block groups that share the same sample_segment are treated as a single stratum for initial-weight computation: base_weight = segment_bg_pop / segment_n_responses.

Block-group population totals are sourced from the crosswalk ([PumaCrosswalk.block_group_populations][processing.weighting.data_prep.crosswalk.PumaCrosswalk.block_group_populations]), not from this table.

Attributes:

Name Type Description
strata pl.DataFrame

Required columns:

  • bg_geo_id (str) — 12-character Census block-group FIPS code.
  • sample_segment (str) — sampling-stratum label. All block groups sharing a segment get the same base weight.

load_sample_plan

load_sample_plan(path: str | Path) -> SamplePlan

Read a sample-plan CSV into a SamplePlan.

Parameters:

Name Type Description Default
path str | Path

Path to a CSV file. Must contain at minimum bg_geo_id and sample_segment. Block-group population totals are sourced from the crosswalk, not from the CSV.

required

Returns:

Type Description
SamplePlan

Raises:

Type Description
FileNotFoundError

If path does not exist.

ValueError

If required columns are missing (raised by SamplePlan).

compute_base_weights

compute_base_weights(
    seed: pl.DataFrame,
    control_totals: ControlTotals,
    targets: list[str],
    geo_col: str = "ctrl_geoid",
    *,
    sample_plan: str | Path | SamplePlan | None = None,
    bg_populations: pl.DataFrame | None = None
) -> pl.DataFrame

Add base_weight column to the seed table.

Parameters:

Name Type Description Default
seed pl.DataFrame

Household seed table from build_seed_table. Must contain hh_id and geo_col. When using a sample plan, must also contain bg_geo_id (assigned via [PumaCrosswalk.assign_block_groups][processing.weighting.data_prep.crosswalk.PumaCrosswalk.assign_block_groups]).

required
control_totals ControlTotals

PUMS-derived targets from build_control_totals.

required
targets list[str]

Control registry names (used to identify the master HH control).

required
geo_col str

Geography column on seed.

'ctrl_geoid'
sample_plan str | Path | SamplePlan | None

Optional sample plan. Accepts a file path (loaded via load_sample_plan) or an already-loaded SamplePlan. When None, default response inversion is used.

None
bg_populations pl.DataFrame | None

Census block-group population totals with columns [bg_geo_id, bg_population]. Required when sample_plan is provided; sourced from [PumaCrosswalk.block_group_populations][processing.weighting.data_prep.crosswalk.PumaCrosswalk.block_group_populations].

None

Returns:

Type Description
pl.DataFrame

seed with an additional base_weight column (Float64).

Raises:

Type Description
ValueError

If no household-level control is found in targets, or if a zone has zero survey responses.

processing.weighting.balancing.importance

MOE-based importance weight calculation.

Uses PUMS successive-difference replicate weights (WGTP1-80 / PWGTP1-80) to estimate the Margin of Error (MOE) for each weighted control total, then converts to importance weights inversely proportional to the coefficient of variation (CV).

Controls with higher sampling uncertainty (larger CV) receive lower importance so the balancer doesn't chase noisy targets. Controls absent from the returned dict (e.g. structural totals whose MOE is meaningless) fall back to the balancer's default importance.

compute_moe_importance

compute_moe_importance(
    hh_df: pl.DataFrame,
    person_df: pl.DataFrame,
    target_names: list[str],
    *,
    geo_col: str = "ctrl_geoid"
) -> dict[str, float]

Compute per-control importance weights from PUMS replicate-weight MOE.

Parameters:

Name Type Description Default
hh_df pl.DataFrame

Crosswalk-allocated PUMS households with _xw_WGTP and _xw_WGTP1_xw_WGTP80 columns (plus recoded controls).

required
person_df pl.DataFrame

Crosswalk-allocated PUMS persons with _xw_PWGTP and _xw_PWGTP1_xw_PWGTP80 columns (plus recoded controls).

required
target_names list[str]

Control registry names to compute importance for.

required
geo_col str

Geography column (default "ctrl_geoid").

'ctrl_geoid'

Returns:

Type Description
dict[str, float]

{control_name: importance} for controls where MOE could be

dict[str, float]

computed. Controls with no PUMS records (data sparsity) are

dict[str, float]

omitted — the balancer will apply its default importance.

Raises:

Type Description
ValueError

If a target name is not in the control registry, or if required replicate weight / control columns are missing from the data.

processing.weighting.balancing.balancer

Maximum-entropy list balancer.

Thin Polars→numpy bridge around PopulationSim's np_balancer_numba. Runs independently per geography zone.

Algorithm

Find weight vector w closest to seed weights w₀ (KL-divergence) subject to marginal constraints:

\[ \min \sum_i w_i \ln(w_i / w_{0i}) \quad \text{s.t.} \quad Aw = t,\; w_i \ge 0 \]

where A is the incidence matrix and t is the target totals vector.

Implementation

Calls populationsim.balancing.balancers_numba.np_balancer_numba directly — a pure @njit function (~120 lines) taking numpy arrays. No PopulationSim pipeline infrastructure involved. Zones are independent and parallelisable via ThreadPoolExecutor.

Configuration (YAML)

max_iterations: 1000
convergence_threshold: 0.001
max_expansion_factor: 10    # upper bound = initial_weight x factor
min_expansion_factor: 0.1   # lower bound = initial_weight x factor

ZoneStatus

Per-zone convergence diagnostics.

balance_weights

balance_weights(
    seed: pl.DataFrame,
    control_totals: ControlTotals,
    targets: list[str],
    balancing: BalancingConfig | None = None,
    importance: ImportanceConfig | None = None,
    *,
    verbose: bool = True
) -> tuple[pl.DataFrame, list[ZoneStatus]]

Balance household weights to match control totals per zone.

Parameters:

Name Type Description Default
seed pl.DataFrame

Incidence table with hh_id, ctrl_geoid, base_weight, and pivoted control columns ({ctrl}__{member} or structural). All merges (global and zone-specific) must already be applied.

required
control_totals ControlTotals

Per-zone targets (with merges already applied).

required
targets list[str]

Control registry names.

required
balancing BalancingConfig | None

Solver bounds, iteration limits, and parallelism (defaults apply).

None
importance ImportanceConfig | None

Per-control importance weights (defaults apply).

None
verbose bool

Log per-zone convergence (default True).

True

Returns:

Type Description
pl.DataFrame

Tuple of (weights, statuses) where weights is a DataFrame

list[ZoneStatus]

with columns hh_id, hh_weight, geo_id, and

tuple[pl.DataFrame, list[ZoneStatus]]

statuses is one ZoneStatus

tuple[pl.DataFrame, list[ZoneStatus]]

entry per zone with convergence info.

processing.weighting.balancing.weight_propagation

Shared weight hierarchy constants and propagation helpers.

Used by both weighting and existing_weights (pre-computed) steps to propagate household weights down through the canonical table hierarchy.

Weight derivation

Table Weight Column Derivation
households hh_weight Direct from balancer
persons person_weight Carry forward hh_weight via hh_id
days day_weight Carry forward person_weight via person_id
unlinked unlinked_trip_weight Carry forward day_weight via day_id
linked linked_trip_weight Mean of constituent unlinked_trip_weight
tours tour_weight Mean of constituent linked_trip_weight

Checksums (logged as warnings if violated)

  • sum(person_weight) ≈ sum(hh_weight x persons_per_hh)
  • sum(day_weight) ≈ sum(person_weight x complete_travel_days)
  • sum(unlinked_trip_weight) ≈ sum(day_weight x trips_per_day)

Completion flag

If a table has a complete boolean column, records with complete == False receive a weight of 0 after the carry-forward join. This ensures that incomplete records never contribute to downstream aggregations (the aggregation step already excludes zeros).

WEIGHT_CONFIG_MAPPING module-attribute

WEIGHT_CONFIG_MAPPING: dict[str, tuple[str, str, str]] = {
    "household_weights": ("households", "hh_id", "hh_weight"),
    "person_weights": ("persons", "person_id", "person_weight"),
    "day_weights": ("days", "day_id", "day_weight"),
    "unlinked_trip_weights": (
        "unlinked_trips",
        "unlinked_trip_id",
        "unlinked_trip_weight",
    ),
    "linked_trip_weights": (
        "linked_trips",
        "linked_trip_id",
        "linked_trip_weight",
    ),
    "joint_trip_weights": (
        "joint_trips",
        "joint_trip_id",
        "joint_trip_weight",
    ),
    "tour_weights": ("tours", "tour_id", "tour_weight"),
}

CARRY_FORWARD module-attribute

CARRY_FORWARD = [
    ("households", "persons", "hh_id", "person_weight"),
    ("persons", "days", "person_id", "day_weight"),
    ("days", "unlinked_trips", "day_id", "unlinked_trip_weight"),
]

AGGREGATE module-attribute

AGGREGATE = [
    (
        "unlinked_trips",
        "linked_trips",
        "linked_trip_id",
        "linked_trip_weight",
    ),
    (
        "linked_trips",
        "joint_trips",
        "joint_trip_id",
        "joint_trip_weight",
    ),
    ("linked_trips", "tours", "tour_id", "tour_weight"),
]

propagate_weights

propagate_weights(
    tables: dict[str, pl.DataFrame | None],
    has_weight: dict[str, str],
    *,
    skip: set[str] | None = None
) -> None

Carry forward and aggregate weights through the hierarchy.

Modifies tables and has_weight in place.

Parameters:

Name Type Description Default
tables dict[str, pl.DataFrame | None]

Mutable dict of table_name → DataFrame (or None).

required
has_weight dict[str, str]

Mutable dict tracking which tables already have a weight column and the column name. E.g. {"households": "hh_weight"}.

required
skip set[str] | None

Table names to skip (e.g. tables that already have externally provided weights).

None