Population-Weighted Geography Crosswalk
Overview
The weighting pipeline uses a population-weighted crosswalk to map PUMAs onto user-defined target zones. Because the source and target boundaries rarely align, Census blocks are rasterized into a population grid, PUMAs are rasterized into a label grid, and exactextract performs zonal cross-tabulation against the target polygons.
The implementation lives in src/processing/weighting/data_prep/crosswalk.py.
flowchart LR
B["Census Blocks"] --> B1["Rasterize population"] --> B2["Population grid"]
P["PUMAs"] --> P1["Rasterize IDs"] --> P2["Label grid"]
Z["Target Zones\n(polygons)"]
B2 --> X["exactextract\n(zonal stats)"]
P2 --> X
Z --> X
X --> C["Crosswalk"]
Calculation Details
1. Rasterization
Two arrays are burned onto a common grid (CRS: EPSG:5070, NAD83 CONUS Albers) at a configurable cell size (default 250 m):
| Array | Dtype | Cell value |
|---|---|---|
Weight (pop) |
float32 | Block population distributed evenly across the cells each block covers: cell_pop = block_pop / n_cells |
| Source label | int32 | Integer ID of the source zone covering that cell (0 = no data) |
2. Zonal Cross-Tabulation (exactextract)
Both rasters are written to temporary GeoTIFFs and passed to exactextract together with the target-zone polygon layer:
exact_extract(
[weight_path, source_path],
target_gdf,
["values", "coverage"],
include_cols=["target_id"],
)
For each target zone polygon z, exactextract returns per-cell arrays of:
v_i_weight: weight value of celliv_i_source: integer source-zone label of cellic_i: coverage fraction of celliby polygonz
In plain terms, each cell contributes weight value * polygon coverage fraction.
3. Population by Source × Target Zone
Within each target zone z, cells are grouped by source label s, and the coverage-weighted population is summed across all matching cells.
4. Allocation Weights
Each source zone's population is then normalized across target zones:
allocation_weight(s, z) = Pop(s, z) / total source population
By construction, the allocation weights sum to 1.0 across target zones for each source zone.
Worked Example
Assume a 4 x 4 grid of equal-sized cells, with each populated cell contributing 10 people.
- Source A occupies the left 8 cells, so its total population is
80. - Source B occupies the right 8 cells, so its total population is
80. - Zone L captures about
75%of Source A. - Zone M captures the remaining
25%of Source A and25%of Source B. - Zone R captures the remaining
75%of Source B.
That gives the following source-by-target populations:
| Source | Target zone | Population contribution |
|---|---|---|
| A | L | 60 |
| A | M | 20 |
| B | M | 20 |
| B | R | 60 |
Now normalize within each source zone:
For Source A:
allocation_weight(A, L) = 60 / 80 = 0.75allocation_weight(A, M) = 20 / 80 = 0.25
For Source B:
allocation_weight(B, M) = 20 / 80 = 0.25allocation_weight(B, R) = 60 / 80 = 0.75
So the final crosswalk rows would look like this:
| source_id | target_id | allocation_weight |
|---|---|---|
| A | L | 0.75 |
| A | M | 0.25 |
| B | M | 0.25 |
| B | R | 0.75 |
This is the key idea: each source geography is split across target zones in proportion to the coverage-weighted population captured in each target polygon.
API
The PUMA-specific wrapper classes live on the Data Preparation page:
PumaCrosswalkTargetZoneConfigGeographyConfig
Generic crosswalk function
build_crosswalk is the underlying generic implementation — it works with any source/target/weight polygon combination and is not specific to PUMAs or weighting.
utils.crosswalk
Generic population-weighted geography crosswalk utilities.
Provides build_crosswalk, a generic function that maps source
polygons to target polygons via a weight polygon layer (e.g. Census
blocks with population counts). The three layers are rasterized onto a
common grid and exactextract performs the zonal cross-tabulation with
sub-pixel coverage fractions.
See src/utils/CROSSWALK.md for the mathematical formulation.
Example
from utils.crosswalk import build_crosswalk
xw = build_crosswalk(
source_gdf=pumas, target_gdf=counties, weight_gdf=blocks,
source_id_col="PUMACE20", target_id_col="COUNTYFP",
weight_col="block_pop", resolution=250,
)
build_crosswalk
build_crosswalk(
source_gdf: gpd.GeoDataFrame,
target_gdf: gpd.GeoDataFrame,
weight_gdf: gpd.GeoDataFrame,
*,
source_id_col: str = "source_id",
target_id_col: str = "target_id",
weight_col: str = "block_pop",
resolution: int = 100,
min_allocation: float = 0.0
) -> pl.DataFrame
Build a population-weighted crosswalk between two polygon layers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_gdf
|
gpd.GeoDataFrame
|
Source zone polygons (e.g. PUMAs). Must contain source_id_col. |
required |
target_gdf
|
gpd.GeoDataFrame
|
Target zone polygons (e.g. counties, TAZs). Must contain target_id_col. |
required |
weight_gdf
|
gpd.GeoDataFrame
|
Weight polygons carrying a numeric attribute (e.g.
Census blocks with |
required |
source_id_col
|
str
|
Column in source_gdf identifying source zones. |
'source_id'
|
target_id_col
|
str
|
Column in target_gdf identifying target zones. |
'target_id'
|
weight_col
|
str
|
Numeric column in weight_gdf used as the allocation weight (typically population). |
'block_pop'
|
resolution
|
int
|
Raster cell size in metres (EPSG:5070). |
100
|
min_allocation
|
float
|
Drop source-target pairs whose allocation weight is below this fraction (e.g. 0.02 = 2%). Remaining weights are not re-normalised; the dropped slivers simply become part of the out-of-region remainder. Default 0.0 (keep all). |
0.0
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
DataFrame with columns |
pl.DataFrame
|
|
pl.DataFrame
|
|
pl.DataFrame
|
total population that falls in the target zone. Weights sum |
pl.DataFrame
|
to <= 1.0 per |
pl.DataFrame
|
all target zones. |