Skip to content

Compute Weights

This path computes new expansion weights from PUMS controls and survey seed data.

It is the full weighting workflow, including:

  • data preparation and PUMS fetching
  • control recoding and shared incidence construction
  • survey null imputation
  • zone assignment and control merges
  • control aggregation and importance resolution
  • max-entropy balancing
  • weight propagation, diagnostics, and validation

Both PUMS (Census microdata) and Survey (travel-diary seed) are conformed to the same incidence schema: one row per household, with shared 1-D controls and cross-tab targets expanded into a common set of incidence columns. That shared representation is what allows the same geography and merge logic to operate on both datasets.

%%{init: {'theme': 'base', 'themeVariables': {
  'background': '#fcfaf6',
  'primaryTextColor': '#22313f',
  'primaryBorderColor': '#9f8b6d',
  'lineColor': '#6b7280',
  'fontFamily': 'IBM Plex Sans, Segoe UI, sans-serif'
}}}%%
flowchart TD
  A("Setup\n• Register controls\n• Resolve 1-D targets\n• Resolve cross-tab targets\n• Build crosswalk\n• Fetch PUMS")

  subgraph inputs ["**Normalize to shared incidence schema**"]
    direction LR
    P0(["PUMS HH + Person"]) --> P1("Recode + pivot") --> PI[["PUMS incidence"]]
    S0(["Survey HH + Person"]) --> S1("Recode + pivot") --> SI[["Survey incidence"]]
  end

  NI("Null imputation on survey only\nRF trained on PUMS incidence")

  subgraph shared_stage ["**Shared incidence transforms**"]
    direction TD
    G1("Zone assignment\n• assign survey HHs\n• allocate PUMAs to control zones")
    G2("Create cross-tab targets")
    G3("Apply merges\ncollapse merged categories")
    G1 --> G2
    G2 --> G3
  end

  subgraph outputs ["**Divergent downstream roles**"]
    direction LR
    PZ[["PUMS incidence\nwith geography + merges"]] --> CT[["Control totals\naggregate PUMS by zone"]]
    PZ --> IM[["Importance\nMOE from PUMS or explicit config"]]
    SZ[["Seed incidence\nwith geography + merges"]]
  end

  B("Max-entropy balancing\nseed incidence × control totals × importance")
  O(["Weight propagation\ndiagnostics\nvalidation"])

  A --> P0
  A --> S0
  PI -. "Train RF Model" .-> NI
  SI --> NI
  PI --> G1
  NI --> G1
  G3 --> PZ
  G3 --> SZ
  SZ --> B
  CT --> B
  IM --> B
  B --> O

  classDef setup fill:#dce7f5,stroke:#557aa3,color:#1d2c3c,stroke-width:2px;
  classDef source fill:#f8f1e5,stroke:#c7a977,color:#3a3126,stroke-width:1.5px;
  classDef action fill:#efe7d8,stroke:#bca07a,color:#2f2a22,stroke-width:1.5px;
  classDef pums fill:#d9ecf7,stroke:#5b8fb9,color:#17324d,stroke-width:1.5px;
  classDef survey fill:#f8dfc9,stroke:#cf8c52,color:#4a2b15,stroke-width:1.5px;
  classDef shared fill:#e3efe2,stroke:#7ca16f,color:#233821,stroke-width:1.5px;
  classDef balance fill:#dcefe6,stroke:#4f7b6c,color:#183229,stroke-width:2px;
  classDef output fill:#f3dca2,stroke:#b6903b,color:#47360f,stroke-width:1.5px;

  class A setup;
  class P0,S0 source;
  class P1,S1,NI action;
  class PI,PZ,CT,IM pums;
  class SI,SZ survey;
  class G1,G2,G3 shared;
  class B balance;
  class O output;

  style inputs fill:#f9f4ec,stroke:#ccb99d,stroke-width:2px,color:#3a3126;
  style shared_stage fill:#edf5ec,stroke:#9db798,stroke-width:2px,color:#233821;
  style outputs fill:#f9f4ec,stroke:#ccb99d,stroke-width:2px,color:#3a3126;

The important split is not that PUMS and survey run as fully parallel pipelines. It is that they are both transformed into the same incidence format. From there, PUMS incidence is used to produce control totals and MOE-based importance, while the survey incidence becomes the seed to be reweighted. Null imputation is applied only to the survey side; PUMS is used as the training source for that step.

processing.weighting.compute_weights

Entry point for the weighting pipeline step.

Orchestrates the full weighting pipeline in the following stages:

A. Geography Crosswalk

  1. Setup -- register controls from YAML config, build the geographic crosswalk (translating between Census PUMAs and the project's custom weighting geography using block-group population as the intermediary), and prepare the sample plan.

B. Control Data Preparation

  1. PUMS recoding -- load PUMS 1-year microdata and recode into the YAML-configured variable bins used by the controls.
  2. Merges -- apply any user-specified category merges (global or zone-specific) to both controls and the survey incidence table.
  3. Control aggregation -- apply the crosswalk to PUMS and aggregate into marginal control totals per zone.

C. Survey Seed Preparation

  1. Survey recoding -- recode canonical survey variables into the same bin / group categories as the PUMS controls.
  2. Null imputation -- fill null-induced zeros in the survey incidence table with RF-predicted fractional class probabilities.
  3. Zone assignment -- assign survey households to weighting zones via the geographic crosswalk.

D. Maximum-Entropy Balancing

  1. Importance resolution -- compute per-control importance weights from replicate-weight MOE or explicit YAML overrides.
  2. Balancing -- build per-household incidence vectors and fit weights using PopulationSim's numba-accelerated max-entropy balancer, independently per zone.
  3. Weight propagation -- propagate final hh_weight to all canonical tables (persons, days, trips, tours).

E. Diagnostics & Validation

  1. Diagnostics -- generate a self-contained interactive HTML report with convergence, fit, and weight-quality diagnostics.
  2. Validation -- run sanity checks on the final weights and control totals. Results are logged as warnings but are not currently included in the HTML report — check the pipeline log to review them.

compute_weights

compute_weights(
    state_fips: str,
    pums_year: int,
    controls: list[dict],
    geography: dict,
    *,
    pums_households: str | None = None,
    pums_persons: str | None = None,
    sample_plan: str | None = None,
    pipeline_cache: PipelineCache | None = None,
    moe_based_importance: bool = False,
    default_importance: float = 100.0,
    max_expansion_factor: float = 10.0,
    min_expansion_factor: float = 0.1,
    min_weight: float | None = 1,
    max_weight: float | None = None,
    max_iterations: int = 10000,
    n_workers: int = 1,
    expansion_factor_grid: list[float] | None = None,
    diagnostics: dict | None = None,
    strict_survey_nulls: bool = False,
    households: pl.DataFrame | None = None,
    persons: pl.DataFrame | None = None,
    days: pl.DataFrame | None = None,
    unlinked_trips: pl.DataFrame | None = None,
    linked_trips: pl.DataFrame | None = None,
    joint_trips: pl.DataFrame | None = None,
    tours: pl.DataFrame | None = None
) -> dict[str, pl.DataFrame]

Compute expansion weights from PUMS controls and propagate to all tables.

Flat-parameter entry point required by the @step() decorator (YAML → keyword args). Constructs a WeightingPipeline; Full documentation of the algorithm, configuration, and diagnostics in included in that class.

WeightingPipeline

WeightingPipeline is the orchestration class that compute_weights constructs and drives. It holds all intermediate state (crosswalk, incidence, control totals, weights, diagnostics) and exposes each stage as an explicit method.

processing.weighting.weighting_pipeline

Top-level weighting pipeline step.

Orchestrates the full weighting pipeline via WeightingPipeline:

  1. Setup — parse YAML config → specs, target names, merges, importance. Cross-tab controls are registered with pre-merged dimensions so the enum reflects the effective cell count.
  2. Data fetching — load PUMS (API or files); receive survey tables.
  3. Conformance — recode both PUMS and survey through identical control expressions → same control-column schema.
  4. Incidence pivot — unified pivoter produces identical {ctrl}__{member} column layout for both datasets.
  5. Zone assignment — crosswalk assigns study_geoid and ctrl_geoid to survey HHs (point-in-polygon) and allocates PUMS weights to target zones. Zone groups (if configured) are applied inside the crosswalk so ctrl_geoid is ready for balancing.
  6. 1-D merges — global merges collapse incidence columns symmetrically on both tables (originals dropped). Zone-specific merges add merged columns (originals kept) and modify control totals for the specified zones after aggregation.
  7. Control totals — aggregate PUMS incidence into target totals per zone.
  8. Balancer — base weights → max-entropy balancing → weight propagation.

Design decisions

  • PopulationSim dependency — uses PopulationSim's core numba balancer (np_balancer_numba) directly — a pure @njit function (~120 lines) taking numpy arrays. No PopulationSim pipeline infrastructure involved.
  • Geography columns — three distinct levels: PUMA (raw Census PUMA), study_geoid (crosswalk target zones from the user's polygon file), and ctrl_geoid (balancing geography, equal to study_geoid unless zone groups are configured). Downstream balancing always uses ctrl_geoid; diagnostics/maps use study_geoid for spatial detail.
  • Symmetric incidence — both the survey sample and the PUMS universe are first recoded and pivoted into incidence tables with identical column layouts. Geography, merges, and crosstabs are applied after incidence construction, keeping the recode/pivot logic independent of geography.

Algorithm

Find weight vector w closest to seed weights w₀ (KL-divergence) subject to marginal constraints:

min Σᵢ wᵢ ln(wᵢ / w₀ᵢ) s.t. A w = t, wᵢ ≥ 0

where A is the incidence matrix and t is the target totals vector. Runs independently per control geography zone (zones are parallelisable).

WeightingPipeline

Stateful weighting pipeline.

Separates configuration (frozen at __init__) from intermediate state (built up phase-by-phase). Phase methods are designed to be called in sequence from the @step() compute_weights entry-point; each stores results as instance attributes.

Usage

pipeline = WeightingPipeline(controls=..., config=..., data=...)

pipeline.setup()
pipeline.fetch_pums()

pipeline.recode_and_pivot()
pipeline.assign_zones()
pipeline.apply_merges()

pipeline.aggregate_totals()
pipeline.resolve_importance()

pipeline.balance()
pipeline.propagate()

pipeline.generate_diagnostics()