Data Preparation
processing.weighting.data_prep
Data-preparation toolbox for the weighting workflow.
This sub-package does not orchestrate the weighting pipeline itself. Instead, it provides the reusable building blocks that the weighting pipeline uses to prepare geography, PUMS inputs, control totals, and survey seed data.
Modules
- Census geography (
census_geo) -- download and cache TIGER PUMA and block shapefiles via pygris, including block-level population inputs. - Geography crosswalk (
crosswalk) -- construct a population-weighted PUMA-to-target-zone allocation table and assign/allocate records across project geographies. - PUMS data (
pums_data) -- fetch ACS 1-year PUMS microdata from the Census API or load local extracts, with chunking and caching helpers. - Control data (
control_data) -- recode PUMS variables into weighting control categories and aggregate them into zone-level control totals. - Seed data (
seed_data) -- recode canonical survey variables into the same control categories used by the PUMS-based controls.
Together these modules provide the shared data-preparation utilities used
by WeightingPipeline
before balancing begins.
processing.weighting.data_prep.crosswalk
PUMA-specific crosswalk wrapper.
PumaCrosswalk fetches Census PUMA and block geographies and
delegates the heavy lifting to build_crosswalk.
See docs/pipeline_steps/weighting/crosswalk.md for the detailed
crosswalk explanation, diagrams, and worked example.
- Load target zone polygons; auto-discover overlapping PUMAs.
- Download/cache TIGER PUMA and block shapefiles (via
census_geo). - Rasterize block population into a density grid.
- Rasterize PUMA IDs into a categorical label grid.
exactextract: computesum(population)per target zone, grouped by PUMA label.- Normalise:
allocation_weight = pop(puma, target) / pop(puma)per PUMA.
Resolution only affects within-block population distribution granularity —
boundary accuracy is exact at any resolution due to exactextract's
analytical sub-cell coverage.
Outputs:
crosswalk_df:puma_id,study_geoid,population,allocation_weight.puma_ids: list of PUMAs overlapping the study area.
Example
xw = PumaCrosswalk(geo_cfg, state_fips="06", pums_year=2023)
households = xw.assign_households(households)
pums_hh, pums_per = xw.allocate_pums_weights(pums_hh, pums_per)
PumaCrosswalk
Population-weighted PUMA-to-target-zone crosswalk.
Loads target zones and Census geographies on construction, rasterizes
block population, and cross-tabulates to produce allocation weights.
Exposes crosswalk_df, puma_ids, and target_gdf for
downstream use. The crosswalk produces both study_geoid (raw
target zone from the shapefile) and ctrl_geoid (balancing
geography — equal to study_geoid unless zone groups are
configured).
TargetZoneConfig
pydantic-model
Target zone polygon source.
Fields:
-
file(str) -
id_field(str | None)
GeographyConfig
pydantic-model
Geography block of the weighting YAML config.
Attributes:
| Name | Type | Description |
|---|---|---|
target_zones |
TargetZoneConfig
|
Polygon file and optional zone-ID column. |
resolution |
int
|
Raster cell size in metres (default 250). |
Fields:
-
target_zones(TargetZoneConfig) -
resolution(int) -
min_allocation(float)
Validators:
-
_positive_resolution→resolution -
_valid_min_allocation→min_allocation
processing.weighting.data_prep.census_geo
Census TIGER geography loading.
PUMA and block geometries share the same decennial vintage (2010 for pre-2022 PUMS, 2020 for 2022+) so that block GEOIDs match the sample plan. Block population is fetched from the Census decennial API (PL 94-171 for 2020, SF1 for 2010) and joined by GEOID — the TIGER shapefiles themselves do not always carry population columns.
Processed GeoDataFrames are cached as GeoParquet under
<pipeline-cache-dir>/weighting/census_geo/ for fast repeat loads.
puma_vintage_for_pums_year
puma_vintage_for_pums_year(pums_year: int) -> int
Return the PUMA geography vintage for a given PUMS year.
ACS PUMS 2022+ uses 2020 PUMAs; 2012-2021 uses 2010 PUMAs.
get_puma_gdf
get_puma_gdf(
state_fips: str,
pums_year: int,
*,
cache_dir: Path | None = None
) -> gpd.GeoDataFrame
Download PUMA polygons for state_fips via pygris.
Returns a GeoDataFrame with puma_id (str) and geometry.
get_block_gdf
get_block_gdf(
state_fips: str,
pums_year: int,
*,
cache_dir: Path | None = None
) -> gpd.GeoDataFrame
Download block geometries and join decennial population.
Uses the same vintage as PUMAs so block GEOIDs match the sample plan. Population is fetched from the Census decennial API and joined by GEOID.
The population is used for dasymetric block-level weighting of cross PUMA-to-custom zone allocations.
Returns a GeoDataFrame with block_id (str), block_pop (int),
and geometry.
processing.weighting.data_prep.pums_data
PUMS microdata I/O.
Downloads ACS PUMS 1-year microdata directly from the Census Bureau API or loads from local CSV / Parquet files. Handles type-casting of Census API string responses to proper numeric dtypes.
API behaviour
- All PUMAs batched in a single API request.
- Column chunking when >48 columns (API limit ~50), parallel via
ThreadPoolExecutor. - JSON → Polars directly (no pandas intermediate).
- Streaming download with
tqdmprogress bars. - Parquet caching at
<cache_dir>/pums/{state}_{year}_{hh|person}.parquet.
Transformation (recoding, aggregation) lives in control_data.
PUMSSource
dataclass
Configuration for PUMS data source.
Attributes:
| Name | Type | Description |
|---|---|---|
state_fips |
str
|
Two-digit FIPS code for the state (e.g. "06" for California). |
pums_year |
int
|
ACS 1-year PUMS vintage (e.g. 2022). |
puma_ids |
list[str] | None
|
Optional list of PUMA codes to fetch. If None, fetches all PUMAs in the state (can be large). |
fetch_pums_data
fetch_pums_data(
source: PUMSSource,
extra_hh_vars: set[str] | None = None,
extra_person_vars: set[str] | None = None,
load_replicate_weights: bool = False,
cache_dir: Path | None = None,
) -> tuple[pl.DataFrame, pl.DataFrame]
Download PUMS household and person microdata from the Census API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
PUMSSource
|
State, year, and optional PUMA filter. |
required |
extra_hh_vars
|
set[str] | None
|
Additional household PUMS variable names to fetch beyond the defaults. |
None
|
extra_person_vars
|
set[str] | None
|
Additional person PUMS variable names to fetch beyond the defaults. |
None
|
load_replicate_weights
|
bool
|
If |
False
|
cache_dir
|
Path | None
|
If set, raw PUMS data is cached as Parquet files under
|
None
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
Tuple of |
pl.DataFrame
|
data typed to appropriate dtypes. |
load_pums_from_files
load_pums_from_files(
hh_path: str,
person_path: str,
state_fips: str | None = None,
puma_ids: list[str] | None = None,
load_replicate_weights: bool = False,
) -> tuple[pl.DataFrame, pl.DataFrame]
Load PUMS data from local CSV/Parquet files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hh_path
|
str
|
Path to household PUMS file (CSV or Parquet). |
required |
person_path
|
str
|
Path to person PUMS file (CSV or Parquet). |
required |
state_fips
|
str | None
|
Optional filter to a specific state FIPS code. |
None
|
puma_ids
|
list[str] | None
|
Optional filter to specific PUMAs. |
None
|
load_replicate_weights
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
tuple[pl.DataFrame, pl.DataFrame]
|
Tuple of |
processing.weighting.data_prep.control_data
PUMS control-data transformation.
Recodes raw PUMS variables into the shared control categories defined in
controls.py, then aggregates weighted totals by geography (PUMA by
default).
All mapping logic lives in controls.py. This module only orchestrates
recoding (via ctrl.pums_expr()) and aggregation.
Approach:
- Load PUMS data; join persons to households to carry PUMA geography.
- Join crosswalk; multiply PUMS weight by
allocation_weightto distribute into custom zones. - For each control: apply filter, recode variable into bins/groups,
aggregate weighted sum by
(ctrl_geoid, category).
Control Variable YAML Configuration Example::
controls:
# Simple marginal — household size
- name: h_size
table: households
variable: NP
bins:
"1": [1, 1]
"2": [2, 2]
"3": [3, 3]
"4+": [4, 99]
# Grouped marginal — commute mode
- name: commute_mode
table: persons
variable: JWTRNS
groups:
drove_alone: [1]
carpool: [2, 3]
transit: [4, 5, 6, 7, 8, 9]
other: [10, 11, 12]
filter: "ESR in [1,2,4,5]" # employed persons only
ControlSpec
dataclass
Specification for a single weighting control.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Registry name (must exist in |
importance |
float | None
|
Explicit importance weight for the balancer. |
dimensions |
list[str] | None
|
For cross-tab controls only: list of dimension control names
(e.g. |
merges |
dict | None
|
For cross-tab controls only: per-dimension merge specs applied
at registration time. |
ControlTotals
dataclass
Result of PUMS control-total aggregation.
Attributes:
| Name | Type | Description |
|---|---|---|
totals |
pl.DataFrame
|
Tidy frame with columns: [geo_id, control_name, category, target_total] |
pums_hh_count |
int
|
Total PUMS housing unit records (before weighting). |
pums_person_count |
int
|
Total PUMS person records. |
geo_ids |
list[str]
|
Unique geography IDs in the totals. |
recode_pums_households
recode_pums_households(
hh_df: pl.DataFrame,
person_df: pl.DataFrame,
targets: list[str] | None = None,
) -> pl.DataFrame
Recode PUMS household records into control categories.
Derives person-level aggregates (workers, children) then loops over
household-level controls calling ctrl.from_pums_row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hh_df
|
pl.DataFrame
|
PUMS household microdata. |
required |
person_df
|
pl.DataFrame
|
PUMS person microdata (used to derive hh-level aggregates). |
required |
targets
|
list[str] | None
|
Registry keys to recode. |
None
|
recode_pums_persons
recode_pums_persons(
person_df: pl.DataFrame, targets: list[str] | None = None
) -> pl.DataFrame
Recode PUMS person records into control categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
person_df
|
pl.DataFrame
|
PUMS person microdata. |
required |
targets
|
list[str] | None
|
Registry keys to recode. |
None
|
build_control_totals
build_control_totals(
hh_df: pl.DataFrame,
person_df: pl.DataFrame,
controls: list[ControlSpec],
geo_col: str = "PUMA",
) -> ControlTotals
Build weighted control totals from recoded PUMS data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hh_df
|
pl.DataFrame
|
Recoded PUMS household data (from |
required |
person_df
|
pl.DataFrame
|
Recoded PUMS person data (from |
required |
controls
|
list[ControlSpec]
|
Control specifications (which variables to include). |
required |
geo_col
|
str
|
Column name for the geography identifier. Defaults to |
'PUMA'
|
Returns:
| Type | Description |
|---|---|
ControlTotals
|
Tidy totals frame and metadata. |
apply_zone_groups
apply_zone_groups(
control_totals: ControlTotals,
seed: pl.DataFrame,
zone_groups: dict[str, list[str]],
geo_col: str = "ctrl_geoid",
) -> tuple[ControlTotals, pl.DataFrame]
Merge zones for balancing while preserving crosswalk granularity.
Zones listed under a group name are remapped to that group in both the control totals and the seed table. Unmapped zones pass through unchanged.
processing.weighting.data_prep.seed_data
Survey seed-data preparation for weighting.
Uses ControlTarget.survey_expr() to recode canonical survey data into
control-category ints via native Polars expressions (vectorised, no
map_elements). Every control is handled uniformly.
Survey field mapping is hardcoded in each
ControlTarget
subclass — the survey_fields class attribute declares which canonical
survey columns the control reads, and survey_expr() returns the Polars
expression that maps those values to control-category ints. For example:
HHIncomeControl.survey_fields = ("income_bin",)— reads the already-binnedincome_bincolumn viaidentity_expr.GenderControl.survey_fields = ("gender",)— mapsGenderenum values toGenderCategoryints viareplace_strict.HHSizeControl.survey_fields = ("_n_persons",)— clips the person-count aggregate column to [1, 10].
No YAML config is involved; the mapping lives entirely in the control
class definitions in processing.weighting.controls.
recode_survey_households
recode_survey_households(
households: pl.DataFrame,
persons: pl.DataFrame,
targets: list[str],
strict_nulls: bool = False,
) -> pl.DataFrame
Recode canonical survey households into control categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
households
|
pl.DataFrame
|
Canonical households table (must have |
required |
persons
|
pl.DataFrame
|
Canonical persons table (must have |
required |
targets
|
list[str]
|
Registry keys to recode. Person-level keys are ignored. |
required |
strict_nulls
|
bool
|
If |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
Unknown target. |
KeyError
|
Missing required column. |
recode_survey_persons
recode_survey_persons(
persons: pl.DataFrame,
targets: list[str],
strict_nulls: bool = False,
) -> pl.DataFrame
Recode canonical survey persons into control categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
persons
|
pl.DataFrame
|
Canonical persons table. |
required |
targets
|
list[str]
|
Registry keys to recode. Household-level keys are ignored. |
required |
strict_nulls
|
bool
|
If |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
Unknown target. |
KeyError
|
Missing required column. |