Data Preparation

processing.weighting.data_prep

Data-preparation toolbox for the weighting workflow.

This sub-package does not orchestrate the weighting pipeline itself. Instead, it provides the reusable building blocks that the weighting pipeline uses to prepare geography, PUMS inputs, control totals, and survey seed data.

Modules

Census geography (census_geo) -- download and cache TIGER PUMA and block shapefiles via pygris, including block-level population inputs.
Geography crosswalk (crosswalk) -- construct a population-weighted PUMA-to-target-zone allocation table and assign/allocate records across project geographies.
PUMS data (pums_data) -- fetch ACS 1-year PUMS microdata from the Census API or load local extracts, with chunking and caching helpers.
Control data (control_data) -- recode PUMS variables into weighting control categories and aggregate them into zone-level control totals.
Seed data (seed_data) -- recode canonical survey variables into the same control categories used by the PUMS-based controls.

Together these modules provide the shared data-preparation utilities used by WeightingPipeline before balancing begins.

processing.weighting.data_prep.crosswalk

PUMA-specific crosswalk wrapper.

PumaCrosswalk fetches Census PUMA and block geographies and delegates the heavy lifting to build_crosswalk.

See docs/pipeline_steps/weighting/crosswalk.md for the detailed crosswalk explanation, diagrams, and worked example.

Load target zone polygons; auto-discover overlapping PUMAs.
Download/cache TIGER PUMA and block shapefiles (via census_geo).
Rasterize block population into a density grid.
Rasterize PUMA IDs into a categorical label grid.
exactextract: compute sum(population) per target zone, grouped by PUMA label.
Normalise: allocation_weight = pop(puma, target) / pop(puma) per PUMA.

Resolution only affects within-block population distribution granularity — boundary accuracy is exact at any resolution due to exactextract's analytical sub-cell coverage.

Outputs:

crosswalk_df: puma_id, study_geoid, population, allocation_weight.
puma_ids: list of PUMAs overlapping the study area.

Example

xw = PumaCrosswalk(geo_cfg, state_fips="06", pums_year=2023)
households = xw.assign_households(households)
pums_hh, pums_per = xw.allocate_pums_weights(pums_hh, pums_per)

PumaCrosswalk

Population-weighted PUMA-to-target-zone crosswalk.

Loads target zones and Census geographies on construction, rasterizes block population, and cross-tabulates to produce allocation weights. Exposes crosswalk_df, puma_ids, and target_gdf for downstream use. The crosswalk produces both study_geoid (raw target zone from the shapefile) and ctrl_geoid (balancing geography — equal to study_geoid unless zone groups are configured).

TargetZoneConfig `pydantic-model`

Target zone polygon source.

Fields:

file (str)
id_field (str | None)

GeographyConfig `pydantic-model`

Geography block of the weighting YAML config.

Attributes:

Name	Type	Description
`target_zones`	`TargetZoneConfig`	Polygon file and optional zone-ID column.
`resolution`	`int`	Raster cell size in metres (default 250).

Fields:

target_zones (TargetZoneConfig)
resolution (int)
min_allocation (float)

Validators:

_positive_resolution → resolution
_valid_min_allocation → min_allocation

processing.weighting.data_prep.census_geo

Census TIGER geography loading.

PUMA and block geometries share the same decennial vintage (2010 for pre-2022 PUMS, 2020 for 2022+) so that block GEOIDs match the sample plan. Block population is fetched from the Census decennial API (PL 94-171 for 2020, SF1 for 2010) and joined by GEOID — the TIGER shapefiles themselves do not always carry population columns.

Processed GeoDataFrames are cached as GeoParquet under <pipeline-cache-dir>/weighting/census_geo/ for fast repeat loads.

puma_vintage_for_pums_year

puma_vintage_for_pums_year(pums_year: int) -> int

Return the PUMA geography vintage for a given PUMS year.

ACS PUMS 2022+ uses 2020 PUMAs; 2012-2021 uses 2010 PUMAs.

get_puma_gdf

get_puma_gdf(
    state_fips: str,
    pums_year: int,
    *,
    cache_dir: Path | None = None
) -> gpd.GeoDataFrame

Download PUMA polygons for state_fips via pygris.

Returns a GeoDataFrame with puma_id (str) and geometry.

get_block_gdf

get_block_gdf(
    state_fips: str,
    pums_year: int,
    *,
    cache_dir: Path | None = None
) -> gpd.GeoDataFrame

Download block geometries and join decennial population.

Uses the same vintage as PUMAs so block GEOIDs match the sample plan. Population is fetched from the Census decennial API and joined by GEOID.

The population is used for dasymetric block-level weighting of cross PUMA-to-custom zone allocations.

Returns a GeoDataFrame with block_id (str), block_pop (int), and geometry.

processing.weighting.data_prep.pums_data

PUMS microdata I/O.

Downloads ACS PUMS 1-year microdata directly from the Census Bureau API or loads from local CSV / Parquet files. Handles type-casting of Census API string responses to proper numeric dtypes.

API behaviour

All PUMAs batched in a single API request.
Column chunking when >48 columns (API limit ~50), parallel via ThreadPoolExecutor.
JSON → Polars directly (no pandas intermediate).
Streaming download with tqdm progress bars.
Parquet caching at <cache_dir>/pums/{state}_{year}_{hh|person}.parquet.

Transformation (recoding, aggregation) lives in control_data.

PUMSSource `dataclass`

Configuration for PUMS data source.

Attributes:

Name	Type	Description
`state_fips`	`str`	Two-digit FIPS code for the state (e.g. "06" for California).
`pums_year`	`int`	ACS 1-year PUMS vintage (e.g. 2022).
`puma_ids`	`list[str] \| None`	Optional list of PUMA codes to fetch. If None, fetches all PUMAs in the state (can be large).

fetch_pums_data

fetch_pums_data(
    source: PUMSSource,
    extra_hh_vars: set[str] | None = None,
    extra_person_vars: set[str] | None = None,
    load_replicate_weights: bool = False,
    cache_dir: Path | None = None,
) -> tuple[pl.DataFrame, pl.DataFrame]

Download PUMS household and person microdata from the Census API.

Parameters:

Name	Type	Description	Default
`source`	`PUMSSource`	State, year, and optional PUMA filter.	required
`extra_hh_vars`	`set[str] \| None`	Additional household PUMS variable names to fetch beyond the defaults.	`None`
`extra_person_vars`	`set[str] \| None`	Additional person PUMS variable names to fetch beyond the defaults.	`None`
`load_replicate_weights`	`bool`	If `True`, also fetch the 80 replicate weight columns per table (`WGTP1` to `WGTP80` and `PWGTP1` to `PWGTP80`). Required for MOE-based importance calculation.	`False`
`cache_dir`	`Path \| None`	If set, raw PUMS data is cached as Parquet files under `cache_dir/pums/`. Subsequent calls with the same state/year load from cache instead of hitting the API.	`None`

Returns:

Type	Description
`pl.DataFrame`	Tuple of `(households, persons)` Polars DataFrames with PUMS
`pl.DataFrame`	data typed to appropriate dtypes.

load_pums_from_files

load_pums_from_files(
    hh_path: str,
    person_path: str,
    state_fips: str | None = None,
    puma_ids: list[str] | None = None,
    load_replicate_weights: bool = False,
) -> tuple[pl.DataFrame, pl.DataFrame]

Load PUMS data from local CSV/Parquet files.

Parameters:

Name	Type	Description	Default
`hh_path`	`str`	Path to household PUMS file (CSV or Parquet).	required
`person_path`	`str`	Path to person PUMS file (CSV or Parquet).	required
`state_fips`	`str \| None`	Optional filter to a specific state FIPS code.	`None`
`puma_ids`	`list[str] \| None`	Optional filter to specific PUMAs.	`None`
`load_replicate_weights`	`bool`	If `True`, retain `WGTP1` to `WGTP80` and `PWGTP1` to `PWGTP80` replicate weight columns.	`False`

Returns:

Type	Description
`tuple[pl.DataFrame, pl.DataFrame]`	Tuple of `(households, persons)` Polars DataFrames.

processing.weighting.data_prep.control_data

PUMS control-data transformation.

Recodes raw PUMS variables into the shared control categories defined in controls.py, then aggregates weighted totals by geography (PUMA by default).

All mapping logic lives in controls.py. This module only orchestrates recoding (via ctrl.pums_expr()) and aggregation.

Approach:

Load PUMS data; join persons to households to carry PUMA geography.
Join crosswalk; multiply PUMS weight by allocation_weight to distribute into custom zones.
For each control: apply filter, recode variable into bins/groups, aggregate weighted sum by (ctrl_geoid, category).

Control Variable YAML Configuration Example::

controls:
  # Simple marginal — household size
  - name: h_size
    table: households
    variable: NP
    bins:
      "1":  [1, 1]
      "2":  [2, 2]
      "3":  [3, 3]
      "4+": [4, 99]

  # Grouped marginal — commute mode
  - name: commute_mode
    table: persons
    variable: JWTRNS
    groups:
      drove_alone: [1]
      carpool:     [2, 3]
      transit:     [4, 5, 6, 7, 8, 9]
      other:       [10, 11, 12]
    filter: "ESR in [1,2,4,5]"   # employed persons only

ControlSpec `dataclass`

Specification for a single weighting control.

Attributes:

Name	Type	Description
`name`	`str`	Registry name (must exist in `CONTROLS`).
`importance`	`float \| None`	Explicit importance weight for the balancer. `None` means use the default (100 for normal controls, 1000 for structural) or the MOE-derived value when `moe_based_importance` is enabled.
`dimensions`	`list[str] \| None`	For cross-tab controls only: list of dimension control names (e.g. `["h_size", "h_income"]`). `None` for standard controls.
`merges`	`dict \| None`	For cross-tab controls only: per-dimension merge specs applied at registration time. `None` for standard controls.

ControlTotals `dataclass`

Result of PUMS control-total aggregation.

Attributes:

Name	Type	Description
`totals`	`pl.DataFrame`	Tidy frame with columns: [geo_id, control_name, category, target_total]
`pums_hh_count`	`int`	Total PUMS housing unit records (before weighting).
`pums_person_count`	`int`	Total PUMS person records.
`geo_ids`	`list[str]`	Unique geography IDs in the totals.

recode_pums_households

recode_pums_households(
    hh_df: pl.DataFrame,
    person_df: pl.DataFrame,
    targets: list[str] | None = None,
) -> pl.DataFrame

Recode PUMS household records into control categories.

Derives person-level aggregates (workers, children) then loops over household-level controls calling ctrl.from_pums_row.

Parameters:

Name	Type	Description	Default
`hh_df`	`pl.DataFrame`	PUMS household microdata.	required
`person_df`	`pl.DataFrame`	PUMS person microdata (used to derive hh-level aggregates).	required
`targets`	`list[str] \| None`	Registry keys to recode. `None` → all household controls.	`None`

recode_pums_persons

recode_pums_persons(
    person_df: pl.DataFrame, targets: list[str] | None = None
) -> pl.DataFrame

Recode PUMS person records into control categories.

Parameters:

Name	Type	Description	Default
`person_df`	`pl.DataFrame`	PUMS person microdata.	required
`targets`	`list[str] \| None`	Registry keys to recode. `None` → all person controls.	`None`

build_control_totals

build_control_totals(
    hh_df: pl.DataFrame,
    person_df: pl.DataFrame,
    controls: list[ControlSpec],
    geo_col: str = "PUMA",
) -> ControlTotals

Build weighted control totals from recoded PUMS data.

Parameters:

Name	Type	Description	Default
`hh_df`	`pl.DataFrame`	Recoded PUMS household data (from `recode_pums_households`).	required
`person_df`	`pl.DataFrame`	Recoded PUMS person data (from `recode_pums_persons`).	required
`controls`	`list[ControlSpec]`	Control specifications (which variables to include).	required
`geo_col`	`str`	Column name for the geography identifier. Defaults to `"PUMA"`.	`'PUMA'`

Returns:

Type	Description
`ControlTotals`	Tidy totals frame and metadata.

apply_zone_groups

apply_zone_groups(
    control_totals: ControlTotals,
    seed: pl.DataFrame,
    zone_groups: dict[str, list[str]],
    geo_col: str = "ctrl_geoid",
) -> tuple[ControlTotals, pl.DataFrame]

Merge zones for balancing while preserving crosswalk granularity.

Zones listed under a group name are remapped to that group in both the control totals and the seed table. Unmapped zones pass through unchanged.

processing.weighting.data_prep.seed_data

Survey seed-data preparation for weighting.

Uses ControlTarget.survey_expr() to recode canonical survey data into control-category ints via native Polars expressions (vectorised, no map_elements). Every control is handled uniformly.

Survey field mapping is hardcoded in each ControlTarget subclass — the survey_fields class attribute declares which canonical survey columns the control reads, and survey_expr() returns the Polars expression that maps those values to control-category ints. For example:

HHIncomeControl.survey_fields = ("income_bin",) — reads the already-binned income_bin column via identity_expr.
GenderControl.survey_fields = ("gender",) — maps Gender enum values to GenderCategory ints via replace_strict.
HHSizeControl.survey_fields = ("_n_persons",) — clips the person-count aggregate column to [1, 10].

No YAML config is involved; the mapping lives entirely in the control class definitions in processing.weighting.controls.

recode_survey_households

recode_survey_households(
    households: pl.DataFrame,
    persons: pl.DataFrame,
    targets: list[str],
    strict_nulls: bool = False,
) -> pl.DataFrame

Recode canonical survey households into control categories.

Parameters:

Name	Type	Description	Default
`households`	`pl.DataFrame`	Canonical households table (must have `hh_id`).	required
`persons`	`pl.DataFrame`	Canonical persons table (must have `hh_id`).	required
`targets`	`list[str]`	Registry keys to recode. Person-level keys are ignored.	required
`strict_nulls`	`bool`	If `True`, raise on any null recode output.	`False`

Raises:

Type	Description
`ValueError`	Unknown target.
`KeyError`	Missing required column.

recode_survey_persons

recode_survey_persons(
    persons: pl.DataFrame,
    targets: list[str],
    strict_nulls: bool = False,
) -> pl.DataFrame

Recode canonical survey persons into control categories.

Parameters:

Name	Type	Description	Default
`persons`	`pl.DataFrame`	Canonical persons table.	required
`targets`	`list[str]`	Registry keys to recode. Household-level keys are ignored.	required
`strict_nulls`	`bool`	If `True`, raise on any null recode output.	`False`

Raises:

Type	Description
`ValueError`	Unknown target.
`KeyError`	Missing required column.

Data Preparation

processing.weighting.data_prep

Modules

processing.weighting.data_prep.crosswalk

PumaCrosswalk

TargetZoneConfig pydantic-model

GeographyConfig pydantic-model

processing.weighting.data_prep.census_geo

puma_vintage_for_pums_year

get_puma_gdf

get_block_gdf

processing.weighting.data_prep.pums_data

API behaviour

PUMSSource dataclass

fetch_pums_data

load_pums_from_files

processing.weighting.data_prep.control_data

ControlSpec dataclass

ControlTotals dataclass

recode_pums_households

recode_pums_persons

build_control_totals

apply_zone_groups

processing.weighting.data_prep.seed_data

recode_survey_households

recode_survey_persons

TargetZoneConfig `pydantic-model`

GeographyConfig `pydantic-model`

PUMSSource `dataclass`

ControlSpec `dataclass`

ControlTotals `dataclass`