Control Generation Step: Creating Baseyear Control Files
This step generates the baseyear control files required for the Bay Area PopulationSim model, using ACS 2023 and 2020 Decennial Census data. Controls are produced at the MAZ, TAZ, and county levels, and are used to guide the synthetic population generation process.
What This Step Does
create_baseyear_controls_23_tm2.py:- Downloads and caches Census data (ACS 2023, Decennial 2020).
- Interpolates geographies as needed to match the MAZ/TAZ system.
- Processes and scales controls at MAZ, TAZ, and county levels, using config-driven definitions.
- Applies county-level scaling to ensure consistency with ACS 2023 county targets.
- Validates and harmonizes controls for internal consistency.
- Outputs all required marginal and summary files for PopulationSim and TM2.
Group Quarters Processing (Updated October 2025)
Important Change: Group quarters controls use person-level controls aligned with Census data structure to ensure data consistency and improve PopulationSim convergence.
Background
Census provides group quarters data at the person level (P5 series tables), while PopulationSim can handle both household-level and person-level controls. The system now uses person-level GQ controls to directly match Census data structure, eliminating conversion assumptions and improving accuracy.
Person-Level Group Quarters Approach
Control Structure (Person Level):
pers_gq_university: University GQ persons (persons.gq_type==1)pers_gq_noninstitutional: Military + other GQ persons combined (persons.gq_type==2)
Census Data Sources:
- University GQ: Census P5_008N (College/university student housing persons)
- Noninstitutional GQ: Census P5_009N + P5_011N + P5_012N (Military quarters + other noninstitutional GQ persons)
Final Group Quarters Inclusion Policy
- ✅ INCLUDED: University/college housing (dorms, student housing) - P5_008N
- ✅ INCLUDED: Military barracks and base housing - P5_009N
- ✅ INCLUDED: Other non-institutional group quarters (group homes, worker dormitories, religious quarters) - P5_011N, P5_012N
- ❌ EXCLUDED: Nursing homes and long-term care facilities - P5_010N
- ❌ EXCLUDED: Correctional institutions and prisons - P5_002N to P5_007N
- ❌ EXCLUDED: Mental health institutions - P5_002N to P5_007N
- ❌ EXCLUDED: Other institutional care facilities - P5_002N to P5_007N
Person-Level Control Structure
Person-level controls count individuals directly from Census data:
pers_gq_university: Count of persons in university GQ (P5_008N)pers_gq_noninstitutional: Count of persons in military + other noninstitutional GQ (P5_009N + P5_011N + P5_012N)
Household Count Integration
The numhh_gq control combines:
- Regular households (
num_hhfrom Census H1_002N) - GQ persons treated as household units (person counts as housing demand proxy)
This approach treats each GQ person as representing potential housing demand while maintaining person-level control accuracy.
Column Naming Standards
Geographic Column Naming Convention
Standardized Column Names:
MAZ_NODE: Standardized MAZ identifier used throughout all crosswalk filesTAZ_NODE: Standardized TAZ identifier used throughout all crosswalk filesCOUNTY: County identifier (numeric, e.g., 1-9 for Bay Area counties)county_name: County name (text, e.g., “Alameda”, “San Francisco”)PUMA: Public Use Microdata Area identifier
Legacy Column Names (Deprecated):
MAZ: Old MAZ column name (replaced byMAZ_NODE)TAZ: Old TAZ column name (replaced byTAZ_NODE)
Control File Column Structure
MAZ Controls (maz_marginals.csv and maz_marginals_hhgq.csv):
MAZ_NODE: MAZ identifier (matchesMAZ_NODEfrom crosswalk)num_hh: Number of households (Census H1_002N)total_pop: Total populationhh_gq_university: University group quarters persons (P5_008N)hh_gq_military: Military group quarters persons (P5_009N) [combined into other]hh_gq_other_nonins: Other noninstitutional group quarters persons (P5_011N, P5_012N)numhh_gq: Combined household + GQ count (for PopulationSim person-as-household approach)
TAZ Controls (taz_marginals.csv and taz_marginals_hhgq.csv):
TAZ_NODE: TAZ identifier (matchesTAZ_NODEfrom crosswalk)num_hh: Number of householdshh_size_1throughhh_size_4plus: Household size categorieshh_inc_0_30kthroughhh_inc_200kplus: Income categoriespers_age_00_17throughpers_age_65plus: Age categoriespers_workers_0throughpers_workers_3plus: Worker categorieshh_size_1_gq: Size-1 households + GQ persons (for HHGQ integration)
County Controls (county_marginals.csv):
COUNTY: County identifier (1-9)pers_occ_management: Management/business/finance workerspers_occ_professional: Professional/technical workerspers_occ_services: Service workerspers_occ_retail: Sales and office workerspers_occ_manual_military: Manual/production + military workers (combined)
Geographic Crosswalk (geo_cross_walk_tm2_maz.csv):
MAZ_NODE: MAZ identifierTAZ_NODE: TAZ identifierCOUNTY: County code (1-9)county_name: County namePUMA: PUMA identifier
Column Naming Migration (October 2025)
What Changed:
The system was updated to use consistent MAZ_NODE/TAZ_NODE naming throughout all geographic crosswalk files. The rebuild_maz_taz_all_geog_file() function in tm2_control_utils/config_census.py was updated to ensure consistent column naming.
Migration Impact:
- All geographic aggregation operations now use standardized column names
- Census geographic matching uses consistent
MAZ_NODE/TAZ_NODEreferences - Control validation and hierarchical consistency checks work with unified naming
- PopulationSim input files use the standardized column structure
Validation:
The mazs_tazs_all_geog.csv crosswalk file was rebuilt with 109,228 records using the new naming convention, ensuring all geographic operations use consistent identifiers.
Group Quarters Control Integration (October 2025)
Military GQ Combination: As of October 2025, military group quarters persons are automatically combined into the “other noninstitutional” category to match the seed population encoding structure:
- Before combination: Separate
hh_gq_militaryandhh_gq_other_noninscolumns - After combination: Military persons (1,684) combined into
hh_gq_other_nonins(final total: 76,071) - File cleanup: Intermediate
maz_marginals.csvautomatically removed, leaving onlymaz_marginals_hhgq.csv
Processing Steps:
- Generate separate military and other noninstitutional GQ controls from Census P5 data
- Validate each control category individually
- Combine military into other noninstitutional to match seed population structure
- Create HHGQ-integrated files for PopulationSim consumption
- Clean up intermediate files to maintain organized workflow
This ensures the control structure exactly matches the seed population GQ encoding while preserving the underlying Census data accuracy.
Column Naming Quick Reference
| Geography Level | File | Key ID Column | Standard Name | Legacy Name |
|---|---|---|---|---|
| MAZ | maz_marginals_hhgq.csv |
MAZ identifier | MAZ |
MAZ |
| TAZ | taz_marginals_hhgq.csv |
TAZ identifier | TAZ |
TAZ |
| County | county_marginals.csv |
County identifier | COUNTY |
N/A |
| Crosswalk | geo_cross_walk_tm2_maz.csv |
MAZ identifier | MAZ_NODE |
MAZ |
| Crosswalk | geo_cross_walk_tm2_maz.csv |
TAZ identifier | TAZ_NODE |
TAZ |
Important:
- Control files (
*_marginals_hhgq.csv) useMAZ/TAZas geography identifiers - Crosswalk files (
geo_cross_walk_tm2_maz.csv) useMAZ_NODE/TAZ_NODEas geography identifiers - PopulationSim config files (
controls.csv) must useMAZ/TAZto match the control file structure - The system handles this mapping automatically during geographic aggregation operations
Inputs
- ACS 2023 5-year and 1-year estimates (tract, block group, county)
- 2020 Decennial Census data (block level)
- Geographic crosswalks (from the crosswalk step)
- Configuration in
unified_tm2_config.pyandtm2_control_utils/config_census.py
Outputs
PopulationSim Input Files (Primary)
maz_marginals_hhgq.csv: MAZ-level controls with integrated households and group quarterstaz_marginals_hhgq.csv: TAZ-level controls with HHGQ integrationcounty_marginals.csv: County-level occupation controls
Supporting Files
geo_cross_walk_tm2_maz.csv: Geographic crosswalk with standardized MAZ_NODE/TAZ_NODE columnsmaz_data.csv,maz_data_withDensity.csv: Land use and density files for TM2county_summary_2020_2023.csv: County scaling factors and validation statisticscounty_targets_2023.csv: Target totals for validation
File Processing Notes
- Intermediate files:
maz_marginals.csvandtaz_marginals.csvare generated during processing but automatically removed after HHGQ integration - Final structure: PopulationSim uses only the
*_hhgq.csvfiles which contain the integrated household+GQ controls - File naming: All output files use the standardized
MAZ_NODE/TAZ_NODEcolumn naming convention
How to Run
From the bay_area directory, run:
python create_baseyear_controls_23_tm2.py
This will generate all control and summary files in the configured output directory.
Notes
- The enhanced crosswalk (
geo_cross_walk_tm2_block10.csv) from the crosswalk step is required as input. - If you update any Census data or crosswalks, you must re-run this step.
- For more details on configuration and file paths, see ENVIRONMENT_SETUP.md and HOW_TO_RUN.md.
Return to the main documentation index for other pipeline steps.