Detailed Population Synthesis and Post-Processing Guide
TM2 PopulationSim Synthesis Engine and Output Processing
Document Version: 1.0
Date: December 2024
Author: PopulationSim Bay Area Team
Table of Contents
- Overview
- Synthesis Engine Architecture
- Phase 1: Population Synthesis
- Phase 2: Post-Processing and Recoding
- Phase 3: Validation and Quality Assurance
- Output Specifications
- Performance Monitoring and Optimization
- Technical Configuration
Overview
The TM2 population synthesis and post-processing system transforms demographic controls and seed population data into a complete synthetic population matching Bay Area demographics at multiple geographic scales. This process employs advanced optimization algorithms to create statistically representative households and persons while maintaining spatial and demographic consistency.
Purpose and Scope
The synthesis and post-processing pipeline serves several critical functions:
- Population Generation: Creates synthetic households and persons matching demographic controls
- Geographic Consistency: Ensures population distribution aligns with TAZ, MAZ, and county-level controls
- Demographic Accuracy: Maintains statistical fidelity to American Community Survey and Census data
- Model Integration: Produces outputs formatted for Transportation Model 2 (TM2) requirements
- Quality Assurance: Validates synthetic population against control totals and demographic patterns
Key Components
The system consists of three primary phases:
- Synthesis Engine (
run_populationsim_synthesis.py): Core PopulationSim algorithm execution - Post-Processing (
postprocess_recode.py): Output formatting and geographic recoding - Validation (
run_all_summaries.py): Quality assurance and performance analysis
Workflow Architecture
Control Data + Seed Population → Synthesis Engine → Raw Synthetic Population
↓
Final Outputs ← Validation & QA ← Post-Processing ← Geographic Recoding
Synthesis Engine Architecture
PopulationSim Algorithm Framework
The synthesis engine employs a hierarchical balancing approach using Iterative Proportional Fitting (IPF) with integer optimization to create synthetic populations that match demographic controls across multiple geographic scales.
Core Algorithm Components
1. Seed Population Expansion
- PUMA-Level Replication: Expands ACS PUMS households/persons to match regional totals
- Weight Assignment: Assigns initial expansion weights based on demographic similarity
- Geographic Distribution: Distributes seed records across TAZ and MAZ geographies
2. Hierarchical Balancing
- Multi-Geography Optimization: Simultaneously balances controls at County, TAZ, and MAZ levels
- Iterative Proportional Fitting: Adjusts weights to minimize deviations from control totals
- Convergence Monitoring: Tracks optimization progress with tolerance thresholds
3. Integer Optimization
- Weight Integerization: Converts fractional weights to integer household counts
- Replication Logic: Creates multiple instances of households with weights > 1
- Constraint Satisfaction: Maintains demographic control adherence during integerization
Mathematical Foundation
Objective Function:
Minimize: Σ (w_i * (synthetic_total - control_total)²)
Subject to:
- Weight constraints: w_i ≥ 0
- Geographic constraints: Σ households_geo = control_geo
- Demographic constraints: Σ demographic_category = control_category
Convergence Criteria:
- Relative Tolerance: 5% deviation from control totals
- Absolute Tolerance: ±20 households/persons per control
- Maximum Iterations: 500 simultaneous, 100,000 sequential
Phase 1: Population Synthesis
Implementation: run_populationsim_synthesis.py
The synthesis phase transforms control data and seed population into a balanced synthetic population through sophisticated optimization algorithms.
Step 1: Input Data Preparation
Seed Population Loading:
# Seed data inputs
households: seed_households.csv # ~96,000 Bay Area household records
persons: seed_persons.csv # ~230,000 Bay Area person records
crosswalk: geo_cross_walk_tm2_maz.csv # Geographic relationships
Control Data Integration:
# Control totals by geography
MAZ_NODE: maz_marginals_hhgq.csv # ~39,586 MAZ zones
TAZ_NODE: taz_marginals_hhgq.csv # ~4,734 TAZ zones
COUNTY: county_marginals.csv # 9 Bay Area counties
Data Validation:
- Schema Consistency: Verify column names and data types
- Geographic Completeness: Ensure all zones have control totals
- Demographic Consistency: Validate control category definitions
Step 2: Control Specification Processing
Control Categories (32 demographic dimensions):
Household Controls:
- Size Categories: 1, 2, 3, 4, 5, 6+ persons
- Worker Categories: 0, 1, 2, 3+ workers
- Income Categories: <$20k, $20k-45k, $45k-60k, $60k-75k, $75k-100k, $100k-150k, $150k-200k, $200k+
- Children: Households with/without children under 18
Person Controls:
- Age Categories: 0-19, 20-34, 35-64, 65+ years
- Occupation Categories: Management, Professional, Services, Retail, Manual/Military
Group Quarters Controls:
- University: College dormitories and group housing
- Non-institutional: Other group quarters facilities
Step 3: Geographic Balancing Hierarchy
Multi-Level Optimization:
PUMA (Seed Geography)
↓
COUNTY (Person Occupation Controls)
↓
TAZ_NODE (Household & Person Demographic Controls)
↓
MAZ_NODE (Total Households + Group Quarters)
Balancing Algorithm:
- Initial Weight Assignment: Assign base weights from seed expansion
- County-Level Balancing: Adjust weights to match county occupation controls
- TAZ-Level Balancing: Balance household size, income, age, and worker categories
- MAZ-Level Balancing: Final adjustment for total household counts and group quarters
Step 4: Synthesis Execution Monitoring
Progress Tracking:
# Enhanced logging system
[2024-12-28 10:15:30] [POPSIM] STEP 1/8: input_pre_processor
[2024-12-28 10:16:45] [POPSIM] STEP 2/8: setup_data_structures
[2024-12-28 10:18:20] [POPSIM] STEP 3/8: initial_seed_balancing
[2024-12-28 10:35:10] [POPSIM] STEP 4/8: meta_control_factoring
[2024-12-28 10:36:45] [POPSIM] STEP 5/8: final_seed_balancing
[2024-12-28 11:15:30] [POPSIM] STEP 6/8: integerize_final_seed_weights
[2024-12-28 11:45:20] [POPSIM] STEP 7/8: sub_balancing
[2024-12-28 12:10:15] [POPSIM] STEP 8/8: expand_households
Performance Metrics:
- Memory Usage: Monitored in real-time (typically 2-8 GB)
- Convergence Rate: Tracked per iteration with tolerance checks
- Processing Time: 45-90 minutes for complete Bay Area synthesis
Step 5: Integer Optimization
Weight Integerization Process:
- Fractional to Integer: Convert optimal fractional weights to integer household counts
- Household Replication: Create multiple instances for households with weights > 1
- Unique ID Assignment: Generate unique household and person identifiers
- Geographic Assignment: Maintain TAZ/MAZ geographic assignments
Output Generation:
synthetic_households.csv # ~1.4M household records
synthetic_persons.csv # ~3.2M person records
summary_TAZ_NODE.csv # TAZ-level control vs. result comparison
summary_COUNTY_*.csv # County-level validation summaries
Phase 2: Post-Processing and Recoding
Implementation: postprocess_recode.py
Post-processing transforms raw PopulationSim outputs into TM2-compatible format with proper geographic coding and demographic recoding.
Step 1: Data Loading and Preparation
Input Integration:
# Load synthesis outputs
households_df = pd.read_csv("synthetic_households.csv") # Raw household data
persons_df = pd.read_csv("synthetic_persons.csv") # Raw person data
crosswalk_df = pd.read_csv("geo_cross_walk_tm2_maz.csv") # Geographic relationships
Unique Identifier Generation:
# Create TM2-compatible unique identifiers
households_df['unique_hh_id'] = households_df['SERIALNO']
persons_df['unique_per_id'] = (persons_df['SERIALNO'].astype(str) +
'_' + persons_df['SPORDER'].astype(str))
Step 2: Geographic Recoding
County Assignment Enhancement:
# Add county information for Group Quarters support
enhanced_households = pd.merge(
households_df,
crosswalk_df[['MAZ_NODE', 'COUNTY']].drop_duplicates(),
on='MAZ_NODE',
how='left'
)
Geographic Field Standardization:
- TAZ_NODE: Keep original TAZ identifier for crosswalk consistency
- MAZ_NODE: Maintain MAZ identifier for spatial precision
- COUNTY: Add 1-9 county identifier for county-level controls
- PUMA: Preserve PUMA assignment for validation
Step 3: Demographic Recoding
Household Variable Transformation:
# TM2-specific household fields
household_columns = {
'unique_hh_id': 'HHID', # Unique household identifier
'TAZ_NODE': 'TAZ_NODE', # TAZ assignment
'MAZ_NODE': 'MAZ_NODE', # MAZ assignment
'COUNTY': 'MTCCountyID', # County 1-9 ID
'hh_income_2010': 'HHINCADJ', # 2010-adjusted income
'hh_workers_from_esr': 'NWRKRS_ESR', # Worker count
'VEH': 'VEH', # Vehicle availability
'NP': 'NP', # Number of persons
'HHT': 'HHT', # Household type
'BLD': 'BLD', # Building type
'TEN': 'TEN', # Tenure (own/rent)
'TYPEHUGQ': 'TYPE' # Housing unit/group quarters type
}
Person Variable Transformation:
# TM2-specific person fields
person_columns = {
'unique_hh_id': 'HHID', # Household link
'unique_per_id': 'PERID', # Unique person identifier
'AGEP': 'AGEP', # Age
'SEX': 'SEX', # Gender
'SCHL': 'SCHL', # Educational attainment
'occupation': 'OCCP', # Occupation category
'WKHP': 'WKHP', # Hours worked per week
'WKW': 'WKW', # Weeks worked per year
'employed': 'EMPLOYED', # Employment status
'ESR': 'ESR', # Employment status recode
'SCHG': 'SCHG', # School grade attendance
'hhgqtype': 'hhgqtype', # Household/group quarters type
'person_type': 'person_type' # Employment-based person type
}
Step 4: Income and Poverty Calculations
Income Adjustments:
# Convert to 2010 dollars for TM2 compatibility
households_df['hh_income_2010'] = households_df['hh_income_2023'] * CPI_2023_TO_2010
# Create income categories
households_df['hinccat1'] = pd.cut(
households_df['hh_income_2010'],
bins=[0, 30000, 60000, 100000, 150000, float('inf')],
labels=[1, 2, 3, 4, 5]
)
Poverty Level Calculations:
# Federal Poverty Level calculations by household size
poverty_thresholds_2023 = {1: 14580, 2: 19720, 3: 24860, 4: 30000,
5: 35140, 6: 40280, 7: 45420, 8: 50560}
households_df['poverty_income_2023d'] = households_df.apply(
lambda row: poverty_thresholds_2023.get(min(row['NP'], 8), 50560), axis=1
)
households_df['pct_of_poverty'] = (households_df['hh_income_2023'] /
households_df['poverty_income_2023d'] * 100)
Step 5: Data Quality and Formatting
Missing Value Handling:
# Replace NaN values with -9 (standard missing value code)
households_df = households_df.fillna(-9)
persons_df = persons_df.fillna(-9)
Data Type Optimization:
# Downcast to integers where possible for memory efficiency
for col in households_df.select_dtypes(include=['float64']):
if households_df[col].min() >= 0 and households_df[col].max() < 2147483647:
households_df[col] = households_df[col].astype('int32')
Output Generation:
synthetic_households_recoded.csv # TM2-formatted household data
synthetic_persons_recoded.csv # TM2-formatted person data
summary_melt.csv # Control vs. result comparison
Phase 3: Validation and Quality Assurance
Implementation: run_all_summaries.py
The validation phase provides comprehensive quality assurance through statistical analysis, comparative validation, and performance assessment.
Core Validation Categories
1. Performance Analysis
- Control Matching: Evaluate synthesis accuracy against demographic controls
- Geographic Distribution: Assess population distribution across TAZ/MAZ zones
- Convergence Assessment: Analyze optimization algorithm performance
2. Dataset Comparison
- Historical Validation: Compare against previous synthesis cycles
- Census Comparison: Validate against ACS and decennial Census data
- Employment Analysis: Assess worker occupation distributions
3. Quality Assurance
- Data Integrity: Check for missing values, invalid codes, and range violations
- Geographic Consistency: Verify household-person geographic alignment
- Demographic Plausibility: Assess realistic household and person characteristics
4. Interactive Visualization
- Tableau Preparation: Format data for external visualization tools
- Dashboard Generation: Create interactive performance dashboards
- Mapping Outputs: Generate geographic distribution visualizations
Statistical Validation Metrics
Control Matching Accuracy:
# Calculate percentage error for each control
pct_error = ((synthetic_total - control_total) / control_total) * 100
# Summary statistics
mean_absolute_error = abs(pct_error).mean()
max_absolute_error = abs(pct_error).max()
controls_within_5pct = (abs(pct_error) <= 5.0).sum() / len(pct_error) * 100
Target Performance Standards:
- MAZ Household Totals: 95% within ±5% of control totals
- TAZ Demographic Categories: 90% within ±10% of control totals
- County Occupation Categories: 95% within ±5% of control totals
- Overall Population Total: Within ±1% of regional ACS estimate
Geographic Validation
Spatial Distribution Assessment:
# TAZ-level validation
taz_validation = synthetic_summary.groupby('TAZ_NODE').agg({
'total_households': 'sum',
'total_persons': 'sum',
'mean_household_size': 'mean',
'median_income': 'median'
})
# Compare against control expectations
spatial_accuracy = compare_distributions(taz_validation, taz_controls)
Cross-Geography Consistency:
- MAZ-TAZ Aggregation: Verify MAZ totals aggregate correctly to TAZ level
- TAZ-County Aggregation: Confirm TAZ totals aggregate to county level
- PUMA Consistency: Validate synthesis maintains PUMA seed geography relationships
Output Specifications
Primary Synthesis Outputs
1. Synthetic Households: synthetic_households_recoded.csv
File Characteristics:
- Record Count: ~1.4 million households
- Geographic Coverage: All Bay Area TAZ and MAZ zones
- File Size: ~150 MB
Schema:
| Column | Type | Description | Example |
|——–|——|————-|———|
| HHID | Integer | Unique household identifier | 1234567 |
| TAZ_NODE | Integer | TAZ assignment | 1001 |
| MAZ_NODE | Integer | MAZ assignment | 12345 |
| MTCCountyID | Integer | County ID (1-9) | 4 |
| HHINCADJ | Integer | Household income (2010$) | 75000 |
| NWRKRS_ESR | Integer | Number of workers | 2 |
| VEH | Integer | Vehicle availability | 2 |
| NP | Integer | Number of persons | 3 |
| HHT | Integer | Household type | 1 |
| BLD | Integer | Building type | 2 |
| TEN | Integer | Tenure (own/rent) | 1 |
| TYPE | Integer | Housing unit/GQ type | 1 |
2. Synthetic Persons: synthetic_persons_recoded.csv
File Characteristics:
- Record Count: ~3.2 million persons
- Age Range: 0-99+ years
- File Size: ~400 MB
Schema:
| Column | Type | Description | Example |
|——–|——|————-|———|
| HHID | Integer | Household identifier | 1234567 |
| PERID | String | Unique person identifier | “1234567_1” |
| AGEP | Integer | Age in years | 34 |
| SEX | Integer | Gender (1=Male, 2=Female) | 2 |
| SCHL | Integer | Educational attainment | 21 |
| OCCP | Integer | Occupation category | 1 |
| WKHP | Integer | Hours worked per week | 40 |
| WKW | Integer | Weeks worked per year | 50 |
| EMPLOYED | Integer | Employment status | 1 |
| ESR | Integer | Employment status recode | 1 |
| SCHG | Integer | School grade attendance | -9 |
Validation and Summary Outputs
3. Control Summary: summary_melt.csv
Purpose: Comprehensive comparison of synthesis results against control totals
Schema:
| Column | Type | Description | Example |
|——–|——|————-|———|
| geography | String | Geographic level | “TAZ_NODE” |
| id | Integer | Geographic identifier | 1001 |
| variable | String | Control category | “hh_size_2” |
| control | Float | Control total | 150.0 |
| result | Float | Synthesis result | 148.5 |
| diff | Float | Absolute difference | -1.5 |
| pct_diff | Float | Percentage difference | -1.0 |
4. Performance Reports
TAZ-Level Summary: final_summary_TAZ_NODE.csv
- Control vs. result comparisons for all TAZ zones
- Convergence statistics and performance metrics
- Geographic distribution analysis
County-Level Summaries: final_summary_COUNTY_[1-9].csv
- County-specific control matching results
- Occupation category validation
- Population distribution by county
Performance Monitoring and Optimization
Synthesis Performance Characteristics
Processing Time Analysis
Typical Processing Times (Bay Area full synthesis):
Data Loading and Preparation: 5-10 minutes
Initial Seed Balancing: 10-15 minutes
Meta Control Factoring: 2-5 minutes
Final Seed Balancing: 20-30 minutes
Weight Integerization: 15-25 minutes
Sub-Balancing: 10-15 minutes
Household Expansion: 5-10 minutes
Total Synthesis Time: 70-110 minutes
Memory Usage Patterns:
- Initial Loading: 2-3 GB RAM
- Peak Balancing: 6-8 GB RAM
- Integerization: 4-6 GB RAM
- Final Output: 3-4 GB RAM
Convergence Monitoring
Real-Time Progress Tracking:
# Heartbeat logging every 5 minutes
[2024-12-28 10:45:30] [HEARTBEAT] PopulationSim still running... 10:45:30
[2024-12-28 10:45:30] [HEARTBEAT] Current step: integerize_final_seed_weights
[2024-12-28 10:45:30] [HEARTBEAT] Memory usage: 6,847.3 MB
[2024-12-28 10:45:30] [HEARTBEAT] Total elapsed: 45.5 minutes
[2024-12-28 10:45:30] [HEARTBEAT] Status: Integerizing final seed weights (this can take 30+ minutes)
Convergence Criteria Monitoring:
# Algorithm convergence tracking
relative_tolerance = 0.05 # 5% relative error tolerance
absolute_tolerance = 20.0 # ±20 unit absolute tolerance
integer_tolerance = 0.5 # 0.5 unit integer tolerance
max_iterations = 500 # Maximum optimization iterations
Optimization Strategies
Algorithm Configuration
Simultaneous vs. Sequential Balancing:
# Enhanced performance configuration
MAX_BALANCE_ITERATIONS_SIMULTANEOUS: 500 # Faster convergence
MAX_BALANCE_ITERATIONS_SEQUENTIAL: 100000 # Fallback for difficult cases
USE_SIMUL_INTEGERIZER: True # Parallel integerization
SUB_BALANCE_WITH_FLOAT_SEED_WEIGHTS: True # Precision optimization
Memory Management:
# Optimize memory usage
GROUP_BY_INCIDENCE_SIGNATURE: False # Reduce memory for large datasets
INTEGERIZE_WITH_BACKSTOPPED_CONTROLS: True # Stable convergence
max_expansion_factor: 50 # Control extreme weights
Performance Tuning
Control Importance Weighting:
# Hierarchical importance levels
MAZ household totals: 100000 (Highest priority)
TAZ person demographics: 100000 (Critical for accuracy)
TAZ household categories: 10000 (Important for distribution)
County occupation: 10000 (Regional consistency)
Hardware Optimization:
- CPU: 8+ cores recommended for parallel processing
- Memory: 16 GB RAM minimum, 32 GB preferred for large datasets
- Storage: SSD recommended for I/O intensive operations
- Network: High-speed connection for data cache access
Technical Configuration
Software Dependencies
Core PopulationSim Framework
- ActivitySim: 1.0+ (population synthesis engine)
- PopulationSim: 1.0+ (balancing algorithms)
- Python: 3.8+ (runtime environment)
Data Processing Libraries
- Pandas: 1.3+ (data manipulation)
- NumPy: 1.21+ (numerical operations)
- SciPy: 1.7+ (optimization algorithms)
Optimization Solvers
- CVXPY: Optional advanced optimization (disabled by default)
- Simultaneous Integerizer: Built-in parallel optimization (enabled)
Configuration Management
Settings Files
# Primary configuration: settings.yaml
geographies: [COUNTY, PUMA, TAZ_NODE, MAZ_NODE]
seed_geography: PUMA
household_weight_col: WGTP
household_id_col: unique_hh_id
total_hh_control: numhh_gq
Control Specification
# Control definitions: controls.csv
target,geography,seed_table,importance,control_field,expression
numhh_gq,MAZ_NODE,households,100000,numhh_gq,households.unique_hh_id > 0
hh_size_1,TAZ_NODE,households,10000,hh_size_1_gq,households.NP == 1
pers_age_00_19,TAZ_NODE,persons,100000,pers_age_00_19,(persons.AGEP >= 0) & (persons.AGEP <= 19)
File Path Management
# Unified configuration system
from unified_tm2_config import UnifiedTM2Config
config = UnifiedTM2Config()
working_dir = config.POPSIM_WORKING_DIR
data_dir = config.POPSIM_DATA_DIR
output_dir = config.PRIMARY_OUTPUT_DIR
Quality Control Parameters
Validation Thresholds
# Performance acceptance criteria
CONTROL_ACCURACY_THRESHOLD = 0.05 # 5% maximum deviation
CONVERGENCE_TOLERANCE = 0.01 # 1% convergence requirement
MAX_PROCESSING_TIME = 7200 # 2 hour timeout
MIN_HOUSEHOLDS_PER_TAZ = 1 # Minimum viable population
Error Handling
# Robust error management
try:
pipeline.run(models=steps, resume_after=resume_after)
validate_synthesis_results()
generate_performance_reports()
except ConvergenceError:
handle_convergence_failure()
except MemoryError:
optimize_memory_usage()
except ValidationError:
generate_diagnostic_reports()
Conclusion
The TM2 population synthesis and post-processing system represents a sophisticated demographic modeling framework that transforms raw demographic controls into a complete, statistically accurate synthetic population. Through its three-phase approach combining advanced optimization algorithms, comprehensive post-processing, and rigorous validation, the system ensures high-quality synthetic data suitable for transportation planning and policy analysis.
Key Achievements:
- Statistical Accuracy: Achieves >95% control matching accuracy across multiple geographic scales
- Computational Efficiency: Processes complete Bay Area synthesis in 70-110 minutes
- Data Quality: Produces TM2-compatible outputs with comprehensive validation
- Scalability: Handles 39,586 MAZ zones and 32 demographic control categories
Technical Innovations:
- Hierarchical Balancing: Multi-geography optimization ensuring spatial consistency
- Real-Time Monitoring: Progress tracking and performance optimization
- Automated Validation: Comprehensive quality assurance with statistical reporting
- Memory Optimization: Efficient processing of large-scale demographic datasets
Future Enhancements:
- Algorithm Optimization: Advanced solver integration for improved convergence
- Parallel Processing: Multi-core optimization for reduced processing time
- Enhanced Validation: Machine learning-based demographic pattern validation
- Cloud Integration: Scalable processing for regional model expansion
This comprehensive system provides the demographic foundation essential for accurate transportation modeling while maintaining the flexibility to adapt to evolving data sources and modeling requirements.