Skip to content

Population-Weighted Geography Crosswalk

Overview

The weighting pipeline uses a population-weighted crosswalk to map PUMAs onto user-defined target zones. Because the source and target boundaries rarely align, Census blocks are rasterized into a population grid, PUMAs are rasterized into a label grid, and exactextract performs zonal cross-tabulation against the target polygons.

The implementation lives in src/processing/weighting/data_prep/crosswalk.py.

flowchart LR
    B["Census Blocks"] --> B1["Rasterize population"] --> B2["Population grid"]
    P["PUMAs"] --> P1["Rasterize IDs"] --> P2["Label grid"]
    Z["Target Zones\n(polygons)"]

    B2 --> X["exactextract\n(zonal stats)"]
    P2 --> X
    Z --> X

    X --> C["Crosswalk"]

Calculation Details

1. Rasterization

Two arrays are burned onto a common grid (CRS: EPSG:5070, NAD83 CONUS Albers) at a configurable cell size (default 250 m):

Array Dtype Cell value
Weight (pop) float32 Block population distributed evenly across the cells each block covers: cell_pop = block_pop / n_cells
Source label int32 Integer ID of the source zone covering that cell (0 = no data)

2. Zonal Cross-Tabulation (exactextract)

Both rasters are written to temporary GeoTIFFs and passed to exactextract together with the target-zone polygon layer:

exact_extract(
    [weight_path, source_path],
    target_gdf,
    ["values", "coverage"],
    include_cols=["target_id"],
)

For each target zone polygon z, exactextract returns per-cell arrays of:

  • v_i_weight: weight value of cell i
  • v_i_source: integer source-zone label of cell i
  • c_i: coverage fraction of cell i by polygon z

In plain terms, each cell contributes weight value * polygon coverage fraction.

3. Population by Source × Target Zone

Within each target zone z, cells are grouped by source label s, and the coverage-weighted population is summed across all matching cells.

4. Allocation Weights

Each source zone's population is then normalized across target zones:

  • allocation_weight(s, z) = Pop(s, z) / total source population

By construction, the allocation weights sum to 1.0 across target zones for each source zone.

Worked Example

Worked example — zone overlap on 4×4 raster grid

Assume a 4 x 4 grid of equal-sized cells, with each populated cell contributing 10 people.

  • Source A occupies the left 8 cells, so its total population is 80.
  • Source B occupies the right 8 cells, so its total population is 80.
  • Zone L captures about 75% of Source A.
  • Zone M captures the remaining 25% of Source A and 25% of Source B.
  • Zone R captures the remaining 75% of Source B.

That gives the following source-by-target populations:

Source Target zone Population contribution
A L 60
A M 20
B M 20
B R 60

Now normalize within each source zone:

For Source A:

  • allocation_weight(A, L) = 60 / 80 = 0.75
  • allocation_weight(A, M) = 20 / 80 = 0.25

For Source B:

  • allocation_weight(B, M) = 20 / 80 = 0.25
  • allocation_weight(B, R) = 60 / 80 = 0.75

So the final crosswalk rows would look like this:

source_id target_id allocation_weight
A L 0.75
A M 0.25
B M 0.25
B R 0.75

This is the key idea: each source geography is split across target zones in proportion to the coverage-weighted population captured in each target polygon.

API

The PUMA-specific wrapper classes live on the Data Preparation page:

  • PumaCrosswalk
  • TargetZoneConfig
  • GeographyConfig

Generic crosswalk function

build_crosswalk is the underlying generic implementation — it works with any source/target/weight polygon combination and is not specific to PUMAs or weighting.

utils.crosswalk

Generic population-weighted geography crosswalk utilities.

Provides build_crosswalk, a generic function that maps source polygons to target polygons via a weight polygon layer (e.g. Census blocks with population counts). The three layers are rasterized onto a common grid and exactextract performs the zonal cross-tabulation with sub-pixel coverage fractions.

See src/utils/CROSSWALK.md for the mathematical formulation.

Example
from utils.crosswalk import build_crosswalk

xw = build_crosswalk(
    source_gdf=pumas, target_gdf=counties, weight_gdf=blocks,
    source_id_col="PUMACE20", target_id_col="COUNTYFP",
    weight_col="block_pop", resolution=250,
)

build_crosswalk

build_crosswalk(
    source_gdf: gpd.GeoDataFrame,
    target_gdf: gpd.GeoDataFrame,
    weight_gdf: gpd.GeoDataFrame,
    *,
    source_id_col: str = "source_id",
    target_id_col: str = "target_id",
    weight_col: str = "block_pop",
    resolution: int = 100,
    min_allocation: float = 0.0
) -> pl.DataFrame

Build a population-weighted crosswalk between two polygon layers.

Parameters:

Name Type Description Default
source_gdf gpd.GeoDataFrame

Source zone polygons (e.g. PUMAs). Must contain source_id_col.

required
target_gdf gpd.GeoDataFrame

Target zone polygons (e.g. counties, TAZs). Must contain target_id_col.

required
weight_gdf gpd.GeoDataFrame

Weight polygons carrying a numeric attribute (e.g. Census blocks with block_pop). Must contain weight_col and geometry.

required
source_id_col str

Column in source_gdf identifying source zones.

'source_id'
target_id_col str

Column in target_gdf identifying target zones.

'target_id'
weight_col str

Numeric column in weight_gdf used as the allocation weight (typically population).

'block_pop'
resolution int

Raster cell size in metres (EPSG:5070).

100
min_allocation float

Drop source-target pairs whose allocation weight is below this fraction (e.g. 0.02 = 2%). Remaining weights are not re-normalised; the dropped slivers simply become part of the out-of-region remainder. Default 0.0 (keep all).

0.0

Returns:

Type Description
pl.DataFrame

DataFrame with columns

pl.DataFrame

[source_id, target_id, population, allocation_weight].

pl.DataFrame

allocation_weight is the fraction of each source zone's

pl.DataFrame

total population that falls in the target zone. Weights sum

pl.DataFrame

to <= 1.0 per source_id; the remainder is population outside

pl.DataFrame

all target zones.