Skip to the content.

Detailed Control Generation Guide for Bay Area PopulationSim

A comprehensive guide to understanding the control generation process, data sources, geographic processing, and scaling methodologies used in the Bay Area PopulationSim model.


Table of Contents

  1. Overview and Purpose
  2. Data Sources and Census Integration
  3. Geographic Framework and Processing
  4. Control Categories and Definitions
  5. Scaling Methodologies
  6. Quality Assurance and Validation
  7. Output Files and Structure
  8. Technical Implementation

Overview and Purpose

The control generation step is the foundation of the Bay Area PopulationSim model, creating the statistical targets that guide synthetic population generation. This process transforms raw Census data into a comprehensive set of marginal controls at multiple geographic levels, ensuring that the synthetic population accurately reflects the demographic, economic, and household characteristics of the Bay Area.

What Are Controls?

Controls are statistical targets that specify how many people or households in each geographic zone should have specific characteristics. For example, a control might specify that Traffic Analysis Zone (TAZ) 1001 should contain exactly 245 households with incomes between $75,000-$100,000, or that Alameda County should have 118,550 management/professional workers.

The Three-Tier Geographic Hierarchy

The Bay Area PopulationSim model operates on a three-tier geographic hierarchy, each serving different modeling purposes:

  1. MAZ (Micro Analysis Zones): ~39,586 zones - Fine-grained geography for detailed local analysis
  2. TAZ (Traffic Analysis Zones): ~4,734 zones - Transportation modeling units for travel demand
  3. COUNTY: 9 zones - Regional units for labor market and economic analysis

Multi-Year Data Integration

The system integrates multiple Census data sources to leverage the best available information:


Data Sources and Census Integration

Primary Data Sources

2020 Decennial Census (Primary Base Data)

ACS 2023 5-Year Estimates (Demographic Detail)

ACS 2023 1-Year Estimates (Current Totals)

Data Processing Workflow

Step 1: Census Data Acquisition

The system automatically downloads and caches Census data using the Census API:

# Example: Downloading household income data
census_data = get_census_data(
    dataset='acs5',
    year=2023,
    table='B19001',  # Household income
    geography='tract',
    state='06',      # California
    county=['001', '013', '041', '055', '075', '081', '085', '095', '097']
)

Step 2: Geographic Interpolation

Since Census geographies don’t perfectly align with MAZ/TAZ boundaries, the system uses sophisticated interpolation:

Step 3: Data Validation and Quality Control

Each data source undergoes rigorous validation:


Geographic Framework and Processing

MAZ (Micro Analysis Zone) Level

MAZs represent the finest geographic resolution in the model, with approximately 39,586 zones covering the 9-county Bay Area.

MAZ Control Generation Process:

  1. Base Data: 2020 Decennial Census at block level
  2. Geographic Aggregation: Blocks aggregated to MAZ using definitive crosswalk
  3. Control Types Generated:
    • Households (num_hh): Direct aggregation of occupied housing units
    • Population (total_pop): Total persons in households and group quarters
    • Group Quarters: Detailed breakdown by institutional type

MAZ Group Quarters Processing:

Group quarters represent persons living in institutional or communal arrangements. The system processes three categories:

Important Note: Institutional group quarters (nursing homes, prisons, hospitals) are excluded as they don’t participate in the regular housing market.

TAZ (Traffic Analysis Zone) Level

TAZs serve as the primary geography for transportation modeling, with approximately 4,734 zones.

TAZ Control Generation Process:

  1. Base Data: ACS 2023 5-Year estimates at tract and block group level
  2. Geographic Processing: Sophisticated interpolation from Census geographies to TAZ
  3. Control Categories:

    Household Size Distribution:

    • hh_size_1: Single-person households
    • hh_size_2: Two-person households
    • hh_size_3: Three-person households
    • hh_size_4: Four-person households
    • hh_size_5: Five-person households
    • hh_size_6_plus: Six or more person households

    Household Income Distribution (in 2023 dollars):

    • inc_lt_20k: Less than $20,000
    • inc_20k_45k: $20,000 to $44,999
    • inc_45k_60k: $45,000 to $59,999
    • inc_60k_75k: $60,000 to $74,999
    • inc_75k_100k: $75,000 to $99,999
    • inc_100k_150k: $100,000 to $149,999
    • inc_150k_200k: $150,000 to $199,999
    • inc_200k_plus: $200,000 or more

    Age Distribution:

    • pers_age_00_19: Children and young adults (0-19 years)
    • pers_age_20_34: Young adults (20-34 years)
    • pers_age_35_64: Middle-aged adults (35-64 years)
    • pers_age_65_plus: Seniors (65+ years)

    Worker Categories:

    • hh_wrks_0: Households with no workers
    • hh_wrks_1: Households with one worker
    • hh_wrks_2: Households with two workers
    • hh_wrks_3_plus: Households with three or more workers

County Level

Counties provide the regional context for labor market analysis, covering the 9-county Bay Area region.

County Control Generation:

Occupation Categories (based on ACS occupation classification):


Scaling Methodologies

The Bay Area PopulationSim model employs sophisticated scaling methodologies to ensure controls reflect current conditions while maintaining internal consistency.

Regional ACS Scaling (TAZ Household Categories)

Purpose and Methodology

TAZ-level household controls are scaled to match ACS 2023 1-year regional totals, ensuring that synthetic households reflect the most current demographic conditions.

Target: 3,031,788 total households (ACS 2023 1-year estimate for 9-county region)

Scaling Process:

  1. Category Total Calculation: Sum all TAZ controls within each category
    Example - Household Size:
    Original totals: hh_size_1=794,695 + hh_size_2=978,628 + ... = 3,039,990
    Target total: 3,031,788 (from ACS 1-year)
    Scaling factor: 3,031,788 ÷ 3,039,990 = 0.997302
    
  2. Proportional Scaling: Apply scaling factor to preserve relative distributions
    Scaled values:
    hh_size_1: 794,695 × 0.997302 = 792,933
    hh_size_2: 978,628 × 0.997302 = 976,292
    
  3. Integer Rounding: Convert to whole households while maintaining totals

Categories Scaled:

County Household Scaling (Person Occupation Controls)

Purpose and Innovation

County occupation controls use a novel scaling approach that leverages household growth patterns as a proxy for worker growth, based on the assumption that worker-to-household ratios by county remain relatively stable between 2020 and 2023.

Scaling Factor Derivation:

County household scaling factors are calculated by comparing 2020 Census to 2023 ACS household counts:

County 2020 Census HH 2023 ACS HH Scaling Factor
Alameda 591,636 646,309 1.0924
Contra Costa 407,029 432,056 1.0615
Marin 104,167 112,359 1.0786
Napa 49,738 56,046 1.1268
San Francisco 371,851 418,143 1.1245
San Mateo 269,417 288,325 1.0702
Santa Clara 656,063 703,922 1.0729
Solano 155,924 165,626 1.0622
Sonoma 187,701 209,002 1.1135

Application to Occupation Controls:

# Example for Alameda County
original_management = 118,550
scaled_management = 118,550 × 1.0924 = 129,499

This approach recognizes that while we lack current occupation data at the county level, household growth patterns provide a reasonable proxy for economic and demographic change.

Validation and Quality Control

Pre-Scaling Validation:

Post-Scaling Validation:


Quality Assurance and Validation

Multi-Level Validation Framework

Level 1: Data Integrity Checks

Level 2: Geographic Consistency

Level 3: Temporal Consistency

Level 4: Cross-Category Validation

Error Detection and Resolution

Automated Quality Checks:

# Example validation check
if abs(total_households_income - total_households_size) / total_households_income > 0.01:
    logger.warning(f"Household category totals differ by {pct_diff:.1f}%")
    apply_harmonization()

Manual Review Triggers:


Output Files and Structure

Primary PopulationSim Input Files

MAZ Controls: maz_marginals_hhgq.csv

MAZ_NODE,numhh_gq,gq_type_univ,gq_type_noninst
10001,185,0,0
10002,221,0,3
10003,181,0,0
...

Structure:

TAZ Controls: taz_marginals_hhgq.csv

TAZ_NODE,inc_lt_20k,inc_20k_45k,inc_45k_60k,...,hh_size_1_gq
301001,23,45,31,...,156
301002,45,67,42,...,203
...

Structure:

County Controls: county_marginals.csv

COUNTY,pers_occ_management,pers_occ_professional,pers_occ_services,pers_occ_retail,pers_occ_manual_military
1,129499,378541,201345,123456,98765
2,94523,267834,145678,87654,76543
...

Structure:

Supporting Files

Geographic Crosswalk: geo_cross_walk_tm2_maz.csv

Essential for linking different geographic levels:

MAZ_NODE,TAZ_NODE,COUNTY,county_name,PUMA
333453,301054,1,Alameda,112
334567,301055,1,Alameda,112
...

Validation Files

County Summary: county_summary_2020_2023.csv Documents scaling factors and validation statistics:

county_fips,county_name,hh_2020_census,hh_2023_acs,scaling_factor
001,Alameda,591636,646309,1.0924
013,Contra Costa,407029,432056,1.0615
...

County Targets: county_targets_2023.csv Regional validation targets:

geography,variable,total
regional,households,3031788
regional,population,7508799
001,households,646309
...

Technical Implementation

System Architecture

Modular Design

The control generation system is built with modular components:

Configuration-Driven Processing

All control definitions are specified in configuration files:

# Example from unified_tm2_config.py
HOUSEHOLD_INCOME_CONTROLS = {
    'inc_lt_20k': {'min': 0, 'max': 19999},
    'inc_20k_45k': {'min': 20000, 'max': 44999},
    # ... additional categories
}

Error Handling and Logging

Comprehensive logging tracks every step:

Performance Optimizations

Caching Strategy

Parallel Processing

Where possible, the system uses parallel processing:

Future Enhancements

Planned Improvements

Research Opportunities


Conclusion

The Bay Area PopulationSim control generation system represents a sophisticated approach to creating statistical targets for synthetic population generation. By integrating multiple Census data sources, employing advanced geographic processing, and implementing rigorous quality controls, the system produces high-quality controls that accurately reflect the Bay Area’s complex demographic and economic landscape.

The system’s strength lies in its ability to balance accuracy, currency, and geographic detail while maintaining internal consistency across multiple levels of geography and demographic categories. The innovative county scaling methodology demonstrates how creative approaches can overcome data limitations to produce more accurate models.

As the Bay Area continues to evolve rapidly, this robust control generation framework provides the foundation for synthetic populations that can accurately represent the region’s diverse communities and support informed planning and policy decisions.


For technical details on running the control generation system, see CONTROL_GENERATION.md. For environment setup, see ENVIRONMENT_SETUP.md.