Population-Weighted Geography Crosswalk

Overview

The weighting pipeline uses a population-weighted crosswalk to map PUMAs onto user-defined target zones. Because the source and target boundaries rarely align, Census blocks are rasterized into a population grid, PUMAs are rasterized into a label grid, and exactextract performs zonal cross-tabulation against the target polygons.

The implementation lives in src/processing/weighting/data_prep/crosswalk.py.

flowchart LR
    B["Census Blocks"] --> B1["Rasterize population"] --> B2["Population grid"]
    P["PUMAs"] --> P1["Rasterize IDs"] --> P2["Label grid"]
    Z["Target Zones\n(polygons)"]

    B2 --> X["exactextract\n(zonal stats)"]
    P2 --> X
    Z --> X

    X --> C["Crosswalk"]

Calculation Details

1. Rasterization

Two arrays are burned onto a common grid (CRS: EPSG:5070, NAD83 CONUS Albers) at a configurable cell size (default 250 m):

Array	Dtype	Cell value
Weight (`pop`)	float32	Block population distributed evenly across the cells each block covers: `cell_pop = block_pop / n_cells`
Source label	int32	Integer ID of the source zone covering that cell (0 = no data)

2. Zonal Cross-Tabulation (`exactextract`)

Both rasters are written to temporary GeoTIFFs and passed to exactextract together with the target-zone polygon layer:

exact_extract(
    [weight_path, source_path],
    target_gdf,
    ["values", "coverage"],
    include_cols=["target_id"],
)

For each target zone polygon z, exactextract returns per-cell arrays of:

v_i_weight: weight value of cell i
v_i_source: integer source-zone label of cell i
c_i: coverage fraction of cell i by polygon z

In plain terms, each cell contributes weight value * polygon coverage fraction.

3. Population by Source × Target Zone

Within each target zone z, cells are grouped by source label s, and the coverage-weighted population is summed across all matching cells.

4. Allocation Weights

Each source zone's population is then normalized across target zones:

allocation_weight(s, z) = Pop(s, z) / total source population

By construction, the allocation weights sum to 1.0 across target zones for each source zone.

Worked Example

Worked example — zone overlap on 4×4 raster grid

Assume a 4 x 4 grid of equal-sized cells, with each populated cell contributing 10 people.

Source A occupies the left 8 cells, so its total population is 80.
Source B occupies the right 8 cells, so its total population is 80.
Zone L captures about 75% of Source A.
Zone M captures the remaining 25% of Source A and 25% of Source B.
Zone R captures the remaining 75% of Source B.

That gives the following source-by-target populations:

Source	Target zone	Population contribution
A	L	60
A	M	20
B	M	20
B	R	60

Now normalize within each source zone:

For Source A:

allocation_weight(A, L) = 60 / 80 = 0.75
allocation_weight(A, M) = 20 / 80 = 0.25

For Source B:

allocation_weight(B, M) = 20 / 80 = 0.25
allocation_weight(B, R) = 60 / 80 = 0.75

So the final crosswalk rows would look like this:

source_id	target_id	allocation_weight
A	L	0.75
A	M	0.25
B	M	0.25
B	R	0.75

This is the key idea: each source geography is split across target zones in proportion to the coverage-weighted population captured in each target polygon.

API

The PUMA-specific wrapper classes live on the Data Preparation page:

PumaCrosswalk
TargetZoneConfig
GeographyConfig

Generic crosswalk function

build_crosswalk is the underlying generic implementation — it works with any source/target/weight polygon combination and is not specific to PUMAs or weighting.

utils.crosswalk

Generic population-weighted geography crosswalk utilities.

Provides build_crosswalk, a generic function that maps source polygons to target polygons via a weight polygon layer (e.g. Census blocks with population counts). The three layers are rasterized onto a common grid and exactextract performs the zonal cross-tabulation with sub-pixel coverage fractions.

See src/utils/CROSSWALK.md for the mathematical formulation.

Example

from utils.crosswalk import build_crosswalk

xw = build_crosswalk(
    source_gdf=pumas, target_gdf=counties, weight_gdf=blocks,
    source_id_col="PUMACE20", target_id_col="COUNTYFP",
    weight_col="block_pop", resolution=250,
)

build_crosswalk

build_crosswalk(
    source_gdf: gpd.GeoDataFrame,
    target_gdf: gpd.GeoDataFrame,
    weight_gdf: gpd.GeoDataFrame,
    *,
    source_id_col: str = "source_id",
    target_id_col: str = "target_id",
    weight_col: str = "block_pop",
    resolution: int = 100,
    min_allocation: float = 0.0
) -> pl.DataFrame

Build a population-weighted crosswalk between two polygon layers.

Parameters:

Name	Type	Description	Default
`source_gdf`	`gpd.GeoDataFrame`	Source zone polygons (e.g. PUMAs). Must contain source_id_col.	required
`target_gdf`	`gpd.GeoDataFrame`	Target zone polygons (e.g. counties, TAZs). Must contain target_id_col.	required
`weight_gdf`	`gpd.GeoDataFrame`	Weight polygons carrying a numeric attribute (e.g. Census blocks with `block_pop`). Must contain weight_col and geometry.	required
`source_id_col`	`str`	Column in source_gdf identifying source zones.	`'source_id'`
`target_id_col`	`str`	Column in target_gdf identifying target zones.	`'target_id'`
`weight_col`	`str`	Numeric column in weight_gdf used as the allocation weight (typically population).	`'block_pop'`
`resolution`	`int`	Raster cell size in metres (EPSG:5070).	`100`
`min_allocation`	`float`	Drop source-target pairs whose allocation weight is below this fraction (e.g. 0.02 = 2%). Remaining weights are not re-normalised; the dropped slivers simply become part of the out-of-region remainder. Default 0.0 (keep all).	`0.0`

Returns:

Type	Description
`pl.DataFrame`	DataFrame with columns
`pl.DataFrame`	`[source_id, target_id, population, allocation_weight]`.
`pl.DataFrame`	`allocation_weight` is the fraction of each source zone's
`pl.DataFrame`	total population that falls in the target zone. Weights sum
`pl.DataFrame`	to <= 1.0 per `source_id`; the remainder is population outside
`pl.DataFrame`	all target zones.