Skip to the content.

Detailed Population Synthesis and Post-Processing Guide

TM2 PopulationSim Synthesis Engine and Output Processing

Document Version: 1.0
Date: December 2024
Author: PopulationSim Bay Area Team


Table of Contents

  1. Overview
  2. Synthesis Engine Architecture
  3. Phase 1: Population Synthesis
  4. Phase 2: Post-Processing and Recoding
  5. Phase 3: Validation and Quality Assurance
  6. Output Specifications
  7. Performance Monitoring and Optimization
  8. Technical Configuration

Overview

The TM2 population synthesis and post-processing system transforms demographic controls and seed population data into a complete synthetic population matching Bay Area demographics at multiple geographic scales. This process employs advanced optimization algorithms to create statistically representative households and persons while maintaining spatial and demographic consistency.

Purpose and Scope

The synthesis and post-processing pipeline serves several critical functions:

Key Components

The system consists of three primary phases:

  1. Synthesis Engine (run_populationsim_synthesis.py): Core PopulationSim algorithm execution
  2. Post-Processing (postprocess_recode.py): Output formatting and geographic recoding
  3. Validation (run_all_summaries.py): Quality assurance and performance analysis

Workflow Architecture

Control Data + Seed Population → Synthesis Engine → Raw Synthetic Population
                                                            ↓
Final Outputs ← Validation & QA ← Post-Processing ← Geographic Recoding

Synthesis Engine Architecture

PopulationSim Algorithm Framework

The synthesis engine employs a hierarchical balancing approach using Iterative Proportional Fitting (IPF) with integer optimization to create synthetic populations that match demographic controls across multiple geographic scales.

Core Algorithm Components

1. Seed Population Expansion

2. Hierarchical Balancing

3. Integer Optimization

Mathematical Foundation

Objective Function:

Minimize: Σ (w_i * (synthetic_total - control_total)²)
Subject to: 
- Weight constraints: w_i ≥ 0
- Geographic constraints: Σ households_geo = control_geo
- Demographic constraints: Σ demographic_category = control_category

Convergence Criteria:


Phase 1: Population Synthesis

Implementation: run_populationsim_synthesis.py

The synthesis phase transforms control data and seed population into a balanced synthetic population through sophisticated optimization algorithms.

Step 1: Input Data Preparation

Seed Population Loading:

# Seed data inputs
households: seed_households.csv    # ~96,000 Bay Area household records
persons: seed_persons.csv         # ~230,000 Bay Area person records
crosswalk: geo_cross_walk_tm2_maz.csv # Geographic relationships

Control Data Integration:

# Control totals by geography
MAZ_NODE: maz_marginals_hhgq.csv    # ~39,586 MAZ zones
TAZ_NODE: taz_marginals_hhgq.csv    # ~4,734 TAZ zones  
COUNTY: county_marginals.csv       # 9 Bay Area counties

Data Validation:

Step 2: Control Specification Processing

Control Categories (32 demographic dimensions):

Household Controls:

Person Controls:

Group Quarters Controls:

Step 3: Geographic Balancing Hierarchy

Multi-Level Optimization:

PUMA (Seed Geography)
  ↓
COUNTY (Person Occupation Controls)
  ↓  
TAZ_NODE (Household & Person Demographic Controls)
  ↓
MAZ_NODE (Total Households + Group Quarters)

Balancing Algorithm:

  1. Initial Weight Assignment: Assign base weights from seed expansion
  2. County-Level Balancing: Adjust weights to match county occupation controls
  3. TAZ-Level Balancing: Balance household size, income, age, and worker categories
  4. MAZ-Level Balancing: Final adjustment for total household counts and group quarters

Step 4: Synthesis Execution Monitoring

Progress Tracking:

# Enhanced logging system
[2024-12-28 10:15:30] [POPSIM] STEP 1/8: input_pre_processor
[2024-12-28 10:16:45] [POPSIM] STEP 2/8: setup_data_structures  
[2024-12-28 10:18:20] [POPSIM] STEP 3/8: initial_seed_balancing
[2024-12-28 10:35:10] [POPSIM] STEP 4/8: meta_control_factoring
[2024-12-28 10:36:45] [POPSIM] STEP 5/8: final_seed_balancing
[2024-12-28 11:15:30] [POPSIM] STEP 6/8: integerize_final_seed_weights
[2024-12-28 11:45:20] [POPSIM] STEP 7/8: sub_balancing
[2024-12-28 12:10:15] [POPSIM] STEP 8/8: expand_households

Performance Metrics:

Step 5: Integer Optimization

Weight Integerization Process:

Output Generation:

synthetic_households.csv  # ~1.4M household records
synthetic_persons.csv     # ~3.2M person records
summary_TAZ_NODE.csv     # TAZ-level control vs. result comparison
summary_COUNTY_*.csv     # County-level validation summaries

Phase 2: Post-Processing and Recoding

Implementation: postprocess_recode.py

Post-processing transforms raw PopulationSim outputs into TM2-compatible format with proper geographic coding and demographic recoding.

Step 1: Data Loading and Preparation

Input Integration:

# Load synthesis outputs
households_df = pd.read_csv("synthetic_households.csv")      # Raw household data
persons_df = pd.read_csv("synthetic_persons.csv")          # Raw person data
crosswalk_df = pd.read_csv("geo_cross_walk_tm2_maz.csv")       # Geographic relationships

Unique Identifier Generation:

# Create TM2-compatible unique identifiers
households_df['unique_hh_id'] = households_df['SERIALNO']
persons_df['unique_per_id'] = (persons_df['SERIALNO'].astype(str) + 
                              '_' + persons_df['SPORDER'].astype(str))

Step 2: Geographic Recoding

County Assignment Enhancement:

# Add county information for Group Quarters support
enhanced_households = pd.merge(
    households_df,
    crosswalk_df[['MAZ_NODE', 'COUNTY']].drop_duplicates(),
    on='MAZ_NODE',
    how='left'
)

Geographic Field Standardization:

Step 3: Demographic Recoding

Household Variable Transformation:

# TM2-specific household fields
household_columns = {
    'unique_hh_id': 'HHID',          # Unique household identifier
    'TAZ_NODE': 'TAZ_NODE',          # TAZ assignment
    'MAZ_NODE': 'MAZ_NODE',          # MAZ assignment  
    'COUNTY': 'MTCCountyID',         # County 1-9 ID
    'hh_income_2010': 'HHINCADJ',    # 2010-adjusted income
    'hh_workers_from_esr': 'NWRKRS_ESR',  # Worker count
    'VEH': 'VEH',                    # Vehicle availability
    'NP': 'NP',                      # Number of persons
    'HHT': 'HHT',                    # Household type
    'BLD': 'BLD',                    # Building type
    'TEN': 'TEN',                    # Tenure (own/rent)
    'TYPEHUGQ': 'TYPE'               # Housing unit/group quarters type
}

Person Variable Transformation:

# TM2-specific person fields
person_columns = {
    'unique_hh_id': 'HHID',          # Household link
    'unique_per_id': 'PERID',        # Unique person identifier
    'AGEP': 'AGEP',                  # Age
    'SEX': 'SEX',                    # Gender
    'SCHL': 'SCHL',                  # Educational attainment
    'occupation': 'OCCP',            # Occupation category
    'WKHP': 'WKHP',                  # Hours worked per week
    'WKW': 'WKW',                    # Weeks worked per year
    'employed': 'EMPLOYED',          # Employment status
    'ESR': 'ESR',                    # Employment status recode
    'SCHG': 'SCHG',                  # School grade attendance
    'hhgqtype': 'hhgqtype',          # Household/group quarters type
    'person_type': 'person_type'     # Employment-based person type
}

Step 4: Income and Poverty Calculations

Income Adjustments:

# Convert to 2010 dollars for TM2 compatibility
households_df['hh_income_2010'] = households_df['hh_income_2023'] * CPI_2023_TO_2010

# Create income categories
households_df['hinccat1'] = pd.cut(
    households_df['hh_income_2010'],
    bins=[0, 30000, 60000, 100000, 150000, float('inf')],
    labels=[1, 2, 3, 4, 5]
)

Poverty Level Calculations:

# Federal Poverty Level calculations by household size
poverty_thresholds_2023 = {1: 14580, 2: 19720, 3: 24860, 4: 30000, 
                           5: 35140, 6: 40280, 7: 45420, 8: 50560}

households_df['poverty_income_2023d'] = households_df.apply(
    lambda row: poverty_thresholds_2023.get(min(row['NP'], 8), 50560), axis=1
)

households_df['pct_of_poverty'] = (households_df['hh_income_2023'] / 
                                  households_df['poverty_income_2023d'] * 100)

Step 5: Data Quality and Formatting

Missing Value Handling:

# Replace NaN values with -9 (standard missing value code)
households_df = households_df.fillna(-9)
persons_df = persons_df.fillna(-9)

Data Type Optimization:

# Downcast to integers where possible for memory efficiency
for col in households_df.select_dtypes(include=['float64']):
    if households_df[col].min() >= 0 and households_df[col].max() < 2147483647:
        households_df[col] = households_df[col].astype('int32')

Output Generation:

synthetic_households_recoded.csv  # TM2-formatted household data
synthetic_persons_recoded.csv     # TM2-formatted person data
summary_melt.csv                  # Control vs. result comparison

Phase 3: Validation and Quality Assurance

Implementation: run_all_summaries.py

The validation phase provides comprehensive quality assurance through statistical analysis, comparative validation, and performance assessment.

Core Validation Categories

1. Performance Analysis

2. Dataset Comparison

3. Quality Assurance

4. Interactive Visualization

Statistical Validation Metrics

Control Matching Accuracy:

# Calculate percentage error for each control
pct_error = ((synthetic_total - control_total) / control_total) * 100

# Summary statistics
mean_absolute_error = abs(pct_error).mean()
max_absolute_error = abs(pct_error).max()
controls_within_5pct = (abs(pct_error) <= 5.0).sum() / len(pct_error) * 100

Target Performance Standards:

Geographic Validation

Spatial Distribution Assessment:

# TAZ-level validation
taz_validation = synthetic_summary.groupby('TAZ_NODE').agg({
    'total_households': 'sum',
    'total_persons': 'sum',
    'mean_household_size': 'mean',
    'median_income': 'median'
})

# Compare against control expectations
spatial_accuracy = compare_distributions(taz_validation, taz_controls)

Cross-Geography Consistency:


Output Specifications

Primary Synthesis Outputs

1. Synthetic Households: synthetic_households_recoded.csv

File Characteristics:

Schema: | Column | Type | Description | Example | |——–|——|————-|———| | HHID | Integer | Unique household identifier | 1234567 | | TAZ_NODE | Integer | TAZ assignment | 1001 | | MAZ_NODE | Integer | MAZ assignment | 12345 | | MTCCountyID | Integer | County ID (1-9) | 4 | | HHINCADJ | Integer | Household income (2010$) | 75000 | | NWRKRS_ESR | Integer | Number of workers | 2 | | VEH | Integer | Vehicle availability | 2 | | NP | Integer | Number of persons | 3 | | HHT | Integer | Household type | 1 | | BLD | Integer | Building type | 2 | | TEN | Integer | Tenure (own/rent) | 1 | | TYPE | Integer | Housing unit/GQ type | 1 |

2. Synthetic Persons: synthetic_persons_recoded.csv

File Characteristics:

Schema: | Column | Type | Description | Example | |——–|——|————-|———| | HHID | Integer | Household identifier | 1234567 | | PERID | String | Unique person identifier | “1234567_1” | | AGEP | Integer | Age in years | 34 | | SEX | Integer | Gender (1=Male, 2=Female) | 2 | | SCHL | Integer | Educational attainment | 21 | | OCCP | Integer | Occupation category | 1 | | WKHP | Integer | Hours worked per week | 40 | | WKW | Integer | Weeks worked per year | 50 | | EMPLOYED | Integer | Employment status | 1 | | ESR | Integer | Employment status recode | 1 | | SCHG | Integer | School grade attendance | -9 |

Validation and Summary Outputs

3. Control Summary: summary_melt.csv

Purpose: Comprehensive comparison of synthesis results against control totals

Schema: | Column | Type | Description | Example | |——–|——|————-|———| | geography | String | Geographic level | “TAZ_NODE” | | id | Integer | Geographic identifier | 1001 | | variable | String | Control category | “hh_size_2” | | control | Float | Control total | 150.0 | | result | Float | Synthesis result | 148.5 | | diff | Float | Absolute difference | -1.5 | | pct_diff | Float | Percentage difference | -1.0 |

4. Performance Reports

TAZ-Level Summary: final_summary_TAZ_NODE.csv

County-Level Summaries: final_summary_COUNTY_[1-9].csv


Performance Monitoring and Optimization

Synthesis Performance Characteristics

Processing Time Analysis

Typical Processing Times (Bay Area full synthesis):

Data Loading and Preparation:        5-10 minutes
Initial Seed Balancing:              10-15 minutes  
Meta Control Factoring:              2-5 minutes
Final Seed Balancing:                20-30 minutes
Weight Integerization:               15-25 minutes
Sub-Balancing:                       10-15 minutes
Household Expansion:                 5-10 minutes
Total Synthesis Time:                70-110 minutes

Memory Usage Patterns:

Convergence Monitoring

Real-Time Progress Tracking:

# Heartbeat logging every 5 minutes
[2024-12-28 10:45:30] [HEARTBEAT] PopulationSim still running... 10:45:30
[2024-12-28 10:45:30] [HEARTBEAT] Current step: integerize_final_seed_weights
[2024-12-28 10:45:30] [HEARTBEAT] Memory usage: 6,847.3 MB
[2024-12-28 10:45:30] [HEARTBEAT] Total elapsed: 45.5 minutes
[2024-12-28 10:45:30] [HEARTBEAT] Status: Integerizing final seed weights (this can take 30+ minutes)

Convergence Criteria Monitoring:

# Algorithm convergence tracking
relative_tolerance = 0.05        # 5% relative error tolerance
absolute_tolerance = 20.0        # ±20 unit absolute tolerance  
integer_tolerance = 0.5          # 0.5 unit integer tolerance
max_iterations = 500             # Maximum optimization iterations

Optimization Strategies

Algorithm Configuration

Simultaneous vs. Sequential Balancing:

# Enhanced performance configuration
MAX_BALANCE_ITERATIONS_SIMULTANEOUS: 500    # Faster convergence
MAX_BALANCE_ITERATIONS_SEQUENTIAL: 100000   # Fallback for difficult cases
USE_SIMUL_INTEGERIZER: True                 # Parallel integerization
SUB_BALANCE_WITH_FLOAT_SEED_WEIGHTS: True   # Precision optimization

Memory Management:

# Optimize memory usage
GROUP_BY_INCIDENCE_SIGNATURE: False        # Reduce memory for large datasets
INTEGERIZE_WITH_BACKSTOPPED_CONTROLS: True # Stable convergence
max_expansion_factor: 50                   # Control extreme weights

Performance Tuning

Control Importance Weighting:

# Hierarchical importance levels
MAZ household totals:     100000  (Highest priority)
TAZ person demographics:  100000  (Critical for accuracy)
TAZ household categories: 10000   (Important for distribution) 
County occupation:        10000   (Regional consistency)

Hardware Optimization:


Technical Configuration

Software Dependencies

Core PopulationSim Framework

Data Processing Libraries

Optimization Solvers

Configuration Management

Settings Files

# Primary configuration: settings.yaml
geographies: [COUNTY, PUMA, TAZ_NODE, MAZ_NODE]
seed_geography: PUMA
household_weight_col: WGTP
household_id_col: unique_hh_id
total_hh_control: numhh_gq

Control Specification

# Control definitions: controls.csv
target,geography,seed_table,importance,control_field,expression
numhh_gq,MAZ_NODE,households,100000,numhh_gq,households.unique_hh_id > 0
hh_size_1,TAZ_NODE,households,10000,hh_size_1_gq,households.NP == 1
pers_age_00_19,TAZ_NODE,persons,100000,pers_age_00_19,(persons.AGEP >= 0) & (persons.AGEP <= 19)

File Path Management

# Unified configuration system
from unified_tm2_config import UnifiedTM2Config

config = UnifiedTM2Config()
working_dir = config.POPSIM_WORKING_DIR
data_dir = config.POPSIM_DATA_DIR
output_dir = config.PRIMARY_OUTPUT_DIR

Quality Control Parameters

Validation Thresholds

# Performance acceptance criteria
CONTROL_ACCURACY_THRESHOLD = 0.05      # 5% maximum deviation
CONVERGENCE_TOLERANCE = 0.01            # 1% convergence requirement
MAX_PROCESSING_TIME = 7200              # 2 hour timeout
MIN_HOUSEHOLDS_PER_TAZ = 1              # Minimum viable population

Error Handling

# Robust error management
try:
    pipeline.run(models=steps, resume_after=resume_after)
    validate_synthesis_results()
    generate_performance_reports()
except ConvergenceError:
    handle_convergence_failure()
except MemoryError:
    optimize_memory_usage()
except ValidationError:
    generate_diagnostic_reports()

Conclusion

The TM2 population synthesis and post-processing system represents a sophisticated demographic modeling framework that transforms raw demographic controls into a complete, statistically accurate synthetic population. Through its three-phase approach combining advanced optimization algorithms, comprehensive post-processing, and rigorous validation, the system ensures high-quality synthetic data suitable for transportation planning and policy analysis.

Key Achievements:

Technical Innovations:

Future Enhancements:

This comprehensive system provides the demographic foundation essential for accurate transportation modeling while maintaining the flexibility to adapt to evolving data sources and modeling requirements.