Skip to the content.

TM2 PopulationSim File Flow

File Flow Overview

This document describes how data flows through the TM2 PopulationSim pipeline, from input sources to final synthetic population.

[External Data] → [Pipeline Steps] → [Intermediate Files] → [Final Outputs]

Input Data Sources

Census Bureau Data

Geographic Reference Data

Configuration Files

Step-by-Step File Flow

Step 1: PUMS Download

External → Pipeline → Output
Census API → download_2023_5year_pums.py → households_2023_raw.csv
                                        → persons_2023_raw.csv

Process:

  1. Queries Census API for Bay Area PUMAs (104 PUMAs)
  2. Downloads household and person records
  3. Applies inflation adjustment (2023→2010 dollars)
  4. Saves raw microdata files

Key Files:

Step 2: Geographic Crosswalk

Reference Data → Pipeline → Output
blocks_mazs_tazs.csv → create_tm2_crosswalk.py → geo_cross_walk_tm2_updated.csv
mazs_tazs_all_geog.csv

Process:

  1. Loads TM2 zone definition files
  2. Creates MAZ-TAZ-PUMA-County relationships
  3. Resolves multi-PUMA TAZs (assigns to dominant PUMA)
  4. Converts FIPS county codes to sequential 1-9 system

Key Files:

Step 3: Seed Population

PUMS Data + Crosswalk → Pipeline → Seed Files
households_2023_raw.csv → create_seed_population_tm2_refactored.py → seed_households.csv
persons_2023_raw.csv                                               → seed_persons.csv
geo_cross_walk_tm2_updated.csv

Process:

  1. Assigns households to PUMAs (keeps original assignment)
  2. Links persons to households via unique IDs
  3. Adds geographic fields (PUMA, county)
  4. Handles Group Quarters population
  5. Creates PopulationSim-compatible formats

Key Files:

Step 4: Marginal Controls

Census API + Config → Pipeline → Control Files
ACS Tables → create_baseyear_controls_23_tm2.py → maz_marginals.csv
controls.csv                                    → taz_marginals.csv
settings.yaml                                   → county_marginals.csv

Process:

  1. Downloads ACS table data via Census API
  2. Processes control specifications from controls.csv
  3. Aggregates to MAZ, TAZ, and County levels
  4. Handles Group Quarters controls separately
  5. Creates age-income cross-tabulations

Key Files:

Step 5: PopulationSim Synthesis

Seed + Controls + Config → PopulationSim → Synthetic Population
seed_households.csv → run_populationsim_synthesis.py → synthetic_households.csv
seed_persons.csv                                     → synthetic_persons.csv
*_marginals.csv                                      → summary_*.csv
settings.yaml

Process:

  1. Loads seed population and marginal controls
  2. Runs iterative proportional fitting (IPF) algorithm
  3. Balances household weights to match controls
  4. Integerizes weights to whole households
  5. Assigns households to specific MAZs

Key Files:

Directory Structure and File Organization

Working Directory Structure

bay_area/
└── output_2023/
    ├── PUMS_2023_5Year/              # Raw PUMS downloads
    │   ├── households_2023_raw.csv
    │   └── persons_2023_raw.csv
    └── populationsim_working_dir/    # PopulationSim workspace
        ├── data/                     # Input data for synthesis
        │   ├── geo_cross_walk_tm2_updated.csv
        │   ├── seed_households.csv
        │   ├── seed_persons.csv
        │   ├── maz_marginals.csv
        │   ├── taz_marginals.csv
        │   └── county_marginals.csv
        ├── configs/                  # PopulationSim configuration
        │   ├── controls.csv
        │   └── settings.yaml
        └── output/                   # Final synthesis results
            ├── synthetic_households.csv
            ├── synthetic_persons.csv
            ├── summary_COUNTY_1.csv
            ├── summary_COUNTY_2.csv
            └── ... (through COUNTY_9)

Configuration Files

bay_area/
├── unified_tm2_config.py           # Master configuration
├── tm2_pipeline.py                 # Pipeline orchestrator
└── hh_gq/                         # PopulationSim templates
    ├── controls.csv               # Control variable definitions
    └── settings.yaml              # PopulationSim settings

File Dependencies and Data Flow

Critical Dependencies

  1. Geographic Consistency: All files must use same geographic definitions
  2. County Mapping: 1-9 sequential system throughout pipeline
  3. PUMA Definitions: Consistent Bay Area PUMA list (104 PUMAs)
  4. Control Variables: Matching between controls.csv and marginal files

Data Transformations

County Code Conversion

FIPS Codes → Sequential IDs
06001 (Alameda) → 4
06013 (Contra Costa) → 5
06041 (Marin) → 9
06055 (Napa) → 7
06075 (San Francisco) → 1
06081 (San Mateo) → 2
06085 (Santa Clara) → 3
06095 (Solano) → 6
06097 (Sonoma) → 8

Income Inflation

2023 ACS Dollars → 2010 Model Dollars
CPI Adjustment Factor: 0.703 (based on CPI 2010 = 218.056, CPI 2023 = 310.0)
Example: $100,000 (2023) → $70,300 (2010)

Group Quarters Handling

HHGQTYPE Values:
1 = Household population
2 = Institutional group quarters (non-university)
3 = Institutional group quarters (university)
4 = Non-institutional group quarters

Control Expressions:
gq_pop: hhgqtype >= 2 (all GQ)
gq_university: hhgqtype == 3 AND age between 18-24
gq_other: hhgqtype == 2 OR hhgqtype == 4

File Validation and Quality Checks

Automated Checks

Manual Validation

Troubleshooting File Issues

Common File Problems

  1. Missing files: Check path configurations in unified_tm2_config.py
  2. Format errors: Verify CSV structure and data types
  3. Geographic mismatches: Ensure consistent zone definitions
  4. Control total issues: Check marginal calculations

Debug Commands

# Check file existence
python tm2_pipeline.py status

# Validate individual steps
python tm2_pipeline.py crosswalk --force
python tm2_pipeline.py seed --force

# Check file row counts
wc -l output_2023/populationsim_working_dir/data/*.csv

Performance and Timing

Typical File Sizes

Processing Times