Skip to content

Data Preparation

processing.weighting.data_prep

Data-preparation toolbox for the weighting workflow.

This sub-package does not orchestrate the weighting pipeline itself. Instead, it provides the reusable building blocks that the weighting pipeline uses to prepare geography, PUMS inputs, control totals, and survey seed data.

Modules

  1. Census geography (census_geo) -- download and cache TIGER PUMA and block shapefiles via pygris, including block-level population inputs.
  2. Geography crosswalk (crosswalk) -- construct a population-weighted PUMA-to-target-zone allocation table and assign/allocate records across project geographies.
  3. PUMS data (pums_data) -- fetch ACS 1-year PUMS microdata from the Census API or load local extracts, with chunking and caching helpers.
  4. Control data (control_data) -- recode PUMS variables into weighting control categories and aggregate them into zone-level control totals.
  5. Seed data (seed_data) -- recode canonical survey variables into the same control categories used by the PUMS-based controls.

Together these modules provide the shared data-preparation utilities used by WeightingPipeline before balancing begins.

processing.weighting.data_prep.crosswalk

PUMA-specific crosswalk wrapper.

PumaCrosswalk fetches Census PUMA and block geographies and delegates the heavy lifting to build_crosswalk.

See docs/pipeline_steps/weighting/crosswalk.md for the detailed crosswalk explanation, diagrams, and worked example.

  1. Load target zone polygons; auto-discover overlapping PUMAs.
  2. Download/cache TIGER PUMA and block shapefiles (via census_geo).
  3. Rasterize block population into a density grid.
  4. Rasterize PUMA IDs into a categorical label grid.
  5. exactextract: compute sum(population) per target zone, grouped by PUMA label.
  6. Normalise: allocation_weight = pop(puma, target) / pop(puma) per PUMA.

Resolution only affects within-block population distribution granularity — boundary accuracy is exact at any resolution due to exactextract's analytical sub-cell coverage.

Outputs:

  • crosswalk_df: puma_id, study_geoid, population, allocation_weight.
  • puma_ids: list of PUMAs overlapping the study area.
Example
xw = PumaCrosswalk(geo_cfg, state_fips="06", pums_year=2023)
households = xw.assign_households(households)
pums_hh, pums_per = xw.allocate_pums_weights(pums_hh, pums_per)

PumaCrosswalk

Population-weighted PUMA-to-target-zone crosswalk.

Loads target zones and Census geographies on construction, rasterizes block population, and cross-tabulates to produce allocation weights. Exposes crosswalk_df, puma_ids, and target_gdf for downstream use. The crosswalk produces both study_geoid (raw target zone from the shapefile) and ctrl_geoid (balancing geography — equal to study_geoid unless zone groups are configured).

TargetZoneConfig pydantic-model

Target zone polygon source.

Fields:

  • file (str)
  • id_field (str | None)

GeographyConfig pydantic-model

Geography block of the weighting YAML config.

Attributes:

Name Type Description
target_zones TargetZoneConfig

Polygon file and optional zone-ID column.

resolution int

Raster cell size in metres (default 250).

Fields:

Validators:

  • _positive_resolutionresolution
  • _valid_min_allocationmin_allocation

processing.weighting.data_prep.census_geo

Census TIGER geography loading.

PUMA and block geometries share the same decennial vintage (2010 for pre-2022 PUMS, 2020 for 2022+) so that block GEOIDs match the sample plan. Block population is fetched from the Census decennial API (PL 94-171 for 2020, SF1 for 2010) and joined by GEOID — the TIGER shapefiles themselves do not always carry population columns.

Processed GeoDataFrames are cached as GeoParquet under <pipeline-cache-dir>/weighting/census_geo/ for fast repeat loads.

puma_vintage_for_pums_year

puma_vintage_for_pums_year(pums_year: int) -> int

Return the PUMA geography vintage for a given PUMS year.

ACS PUMS 2022+ uses 2020 PUMAs; 2012-2021 uses 2010 PUMAs.

get_puma_gdf

get_puma_gdf(
    state_fips: str,
    pums_year: int,
    *,
    cache_dir: Path | None = None
) -> gpd.GeoDataFrame

Download PUMA polygons for state_fips via pygris.

Returns a GeoDataFrame with puma_id (str) and geometry.

get_block_gdf

get_block_gdf(
    state_fips: str,
    pums_year: int,
    *,
    cache_dir: Path | None = None
) -> gpd.GeoDataFrame

Download block geometries and join decennial population.

Uses the same vintage as PUMAs so block GEOIDs match the sample plan. Population is fetched from the Census decennial API and joined by GEOID.

The population is used for dasymetric block-level weighting of cross PUMA-to-custom zone allocations.

Returns a GeoDataFrame with block_id (str), block_pop (int), and geometry.

processing.weighting.data_prep.pums_data

PUMS microdata I/O.

Downloads ACS PUMS 1-year microdata directly from the Census Bureau API or loads from local CSV / Parquet files. Handles type-casting of Census API string responses to proper numeric dtypes.

API behaviour

  • All PUMAs batched in a single API request.
  • Column chunking when >48 columns (API limit ~50), parallel via ThreadPoolExecutor.
  • JSON → Polars directly (no pandas intermediate).
  • Streaming download with tqdm progress bars.
  • Parquet caching at <cache_dir>/pums/{state}_{year}_{hh|person}.parquet.

Transformation (recoding, aggregation) lives in control_data.

PUMSSource dataclass

Configuration for PUMS data source.

Attributes:

Name Type Description
state_fips str

Two-digit FIPS code for the state (e.g. "06" for California).

pums_year int

ACS 1-year PUMS vintage (e.g. 2022).

puma_ids list[str] | None

Optional list of PUMA codes to fetch. If None, fetches all PUMAs in the state (can be large).

fetch_pums_data

fetch_pums_data(
    source: PUMSSource,
    extra_hh_vars: set[str] | None = None,
    extra_person_vars: set[str] | None = None,
    load_replicate_weights: bool = False,
    cache_dir: Path | None = None,
) -> tuple[pl.DataFrame, pl.DataFrame]

Download PUMS household and person microdata from the Census API.

Parameters:

Name Type Description Default
source PUMSSource

State, year, and optional PUMA filter.

required
extra_hh_vars set[str] | None

Additional household PUMS variable names to fetch beyond the defaults.

None
extra_person_vars set[str] | None

Additional person PUMS variable names to fetch beyond the defaults.

None
load_replicate_weights bool

If True, also fetch the 80 replicate weight columns per table (WGTP1 to WGTP80 and PWGTP1 to PWGTP80). Required for MOE-based importance calculation.

False
cache_dir Path | None

If set, raw PUMS data is cached as Parquet files under cache_dir/pums/. Subsequent calls with the same state/year load from cache instead of hitting the API.

None

Returns:

Type Description
pl.DataFrame

Tuple of (households, persons) Polars DataFrames with PUMS

pl.DataFrame

data typed to appropriate dtypes.

load_pums_from_files

load_pums_from_files(
    hh_path: str,
    person_path: str,
    state_fips: str | None = None,
    puma_ids: list[str] | None = None,
    load_replicate_weights: bool = False,
) -> tuple[pl.DataFrame, pl.DataFrame]

Load PUMS data from local CSV/Parquet files.

Parameters:

Name Type Description Default
hh_path str

Path to household PUMS file (CSV or Parquet).

required
person_path str

Path to person PUMS file (CSV or Parquet).

required
state_fips str | None

Optional filter to a specific state FIPS code.

None
puma_ids list[str] | None

Optional filter to specific PUMAs.

None
load_replicate_weights bool

If True, retain WGTP1 to WGTP80 and PWGTP1 to PWGTP80 replicate weight columns.

False

Returns:

Type Description
tuple[pl.DataFrame, pl.DataFrame]

Tuple of (households, persons) Polars DataFrames.

processing.weighting.data_prep.control_data

PUMS control-data transformation.

Recodes raw PUMS variables into the shared control categories defined in controls.py, then aggregates weighted totals by geography (PUMA by default).

All mapping logic lives in controls.py. This module only orchestrates recoding (via ctrl.pums_expr()) and aggregation.

Approach:

  1. Load PUMS data; join persons to households to carry PUMA geography.
  2. Join crosswalk; multiply PUMS weight by allocation_weight to distribute into custom zones.
  3. For each control: apply filter, recode variable into bins/groups, aggregate weighted sum by (ctrl_geoid, category).

Control Variable YAML Configuration Example::

controls:
  # Simple marginal — household size
  - name: h_size
    table: households
    variable: NP
    bins:
      "1":  [1, 1]
      "2":  [2, 2]
      "3":  [3, 3]
      "4+": [4, 99]

  # Grouped marginal — commute mode
  - name: commute_mode
    table: persons
    variable: JWTRNS
    groups:
      drove_alone: [1]
      carpool:     [2, 3]
      transit:     [4, 5, 6, 7, 8, 9]
      other:       [10, 11, 12]
    filter: "ESR in [1,2,4,5]"   # employed persons only

ControlSpec dataclass

Specification for a single weighting control.

Attributes:

Name Type Description
name str

Registry name (must exist in CONTROLS).

importance float | None

Explicit importance weight for the balancer. None means use the default (100 for normal controls, 1000 for structural) or the MOE-derived value when moe_based_importance is enabled.

dimensions list[str] | None

For cross-tab controls only: list of dimension control names (e.g. ["h_size", "h_income"]). None for standard controls.

merges dict | None

For cross-tab controls only: per-dimension merge specs applied at registration time. None for standard controls.

ControlTotals dataclass

Result of PUMS control-total aggregation.

Attributes:

Name Type Description
totals pl.DataFrame

Tidy frame with columns: [geo_id, control_name, category, target_total]

pums_hh_count int

Total PUMS housing unit records (before weighting).

pums_person_count int

Total PUMS person records.

geo_ids list[str]

Unique geography IDs in the totals.

recode_pums_households

recode_pums_households(
    hh_df: pl.DataFrame,
    person_df: pl.DataFrame,
    targets: list[str] | None = None,
) -> pl.DataFrame

Recode PUMS household records into control categories.

Derives person-level aggregates (workers, children) then loops over household-level controls calling ctrl.from_pums_row.

Parameters:

Name Type Description Default
hh_df pl.DataFrame

PUMS household microdata.

required
person_df pl.DataFrame

PUMS person microdata (used to derive hh-level aggregates).

required
targets list[str] | None

Registry keys to recode. None → all household controls.

None

recode_pums_persons

recode_pums_persons(
    person_df: pl.DataFrame, targets: list[str] | None = None
) -> pl.DataFrame

Recode PUMS person records into control categories.

Parameters:

Name Type Description Default
person_df pl.DataFrame

PUMS person microdata.

required
targets list[str] | None

Registry keys to recode. None → all person controls.

None

build_control_totals

build_control_totals(
    hh_df: pl.DataFrame,
    person_df: pl.DataFrame,
    controls: list[ControlSpec],
    geo_col: str = "PUMA",
) -> ControlTotals

Build weighted control totals from recoded PUMS data.

Parameters:

Name Type Description Default
hh_df pl.DataFrame

Recoded PUMS household data (from recode_pums_households).

required
person_df pl.DataFrame

Recoded PUMS person data (from recode_pums_persons).

required
controls list[ControlSpec]

Control specifications (which variables to include).

required
geo_col str

Column name for the geography identifier. Defaults to "PUMA".

'PUMA'

Returns:

Type Description
ControlTotals

Tidy totals frame and metadata.

apply_zone_groups

apply_zone_groups(
    control_totals: ControlTotals,
    seed: pl.DataFrame,
    zone_groups: dict[str, list[str]],
    geo_col: str = "ctrl_geoid",
) -> tuple[ControlTotals, pl.DataFrame]

Merge zones for balancing while preserving crosswalk granularity.

Zones listed under a group name are remapped to that group in both the control totals and the seed table. Unmapped zones pass through unchanged.

processing.weighting.data_prep.seed_data

Survey seed-data preparation for weighting.

Uses ControlTarget.survey_expr() to recode canonical survey data into control-category ints via native Polars expressions (vectorised, no map_elements). Every control is handled uniformly.

Survey field mapping is hardcoded in each ControlTarget subclass — the survey_fields class attribute declares which canonical survey columns the control reads, and survey_expr() returns the Polars expression that maps those values to control-category ints. For example:

  • HHIncomeControl.survey_fields = ("income_bin",) — reads the already-binned income_bin column via identity_expr.
  • GenderControl.survey_fields = ("gender",) — maps Gender enum values to GenderCategory ints via replace_strict.
  • HHSizeControl.survey_fields = ("_n_persons",) — clips the person-count aggregate column to [1, 10].

No YAML config is involved; the mapping lives entirely in the control class definitions in processing.weighting.controls.

recode_survey_households

recode_survey_households(
    households: pl.DataFrame,
    persons: pl.DataFrame,
    targets: list[str],
    strict_nulls: bool = False,
) -> pl.DataFrame

Recode canonical survey households into control categories.

Parameters:

Name Type Description Default
households pl.DataFrame

Canonical households table (must have hh_id).

required
persons pl.DataFrame

Canonical persons table (must have hh_id).

required
targets list[str]

Registry keys to recode. Person-level keys are ignored.

required
strict_nulls bool

If True, raise on any null recode output.

False

Raises:

Type Description
ValueError

Unknown target.

KeyError

Missing required column.

recode_survey_persons

recode_survey_persons(
    persons: pl.DataFrame,
    targets: list[str],
    strict_nulls: bool = False,
) -> pl.DataFrame

Recode canonical survey persons into control categories.

Parameters:

Name Type Description Default
persons pl.DataFrame

Canonical persons table.

required
targets list[str]

Registry keys to recode. Household-level keys are ignored.

required
strict_nulls bool

If True, raise on any null recode output.

False

Raises:

Type Description
ValueError

Unknown target.

KeyError

Missing required column.