Comprehensive Output Data Guide
TM2 PopulationSim Deliverables and Analysis Products
Document Version: 1.0
Date: December 2024
Author: PopulationSim Bay Area Team
Table of Contents
- Overview
- Primary Synthesis Outputs
- Validation and Quality Assurance Reports
- Analysis and Visualization Products
- Tableau-Ready Data Products
- Technical and Processing Outputs
- Output Usage Guidelines
- Data Quality Standards
Overview
The TM2 PopulationSim system generates a comprehensive suite of outputs designed to support transportation modeling, policy analysis, and demographic research. This guide documents all deliverables, their specifications, and appropriate use cases for each output product.
Output Categories
The system produces eight primary categories of outputs:
- Primary Synthesis Outputs: TM2-formatted synthetic population data
- Validation Reports: Quality assurance and performance metrics
- Analysis Products: Demographic summaries and comparative analysis
- Visualization Products: Charts, maps, and interactive dashboards
- Tableau Data: Business intelligence-ready datasets
- Technical Outputs: Processing logs and diagnostic information
- Raw Synthesis Results: Unprocessed PopulationSim engine outputs
- Reference Documentation: Metadata and usage guides
Output Architecture
PopulationSim Engine → Raw Outputs → Post-Processing → Final TM2 Outputs
↓ ↓ ↓
Validation → Analysis → Visualization → Tableau Products
Critical Success Metrics
Data Volume: ~1.4M households, ~3.2M persons for complete Bay Area
Geographic Coverage: 39,586 MAZ zones, 4,734 TAZ zones, 9 counties
Quality Standards: >95% control matching accuracy, <5% demographic deviation
Format Compliance: TM2 schema requirements, ActivitySim compatibility
Primary Synthesis Outputs
TM2 Synthetic Households
File: households_2023_tm2.csv
Location: output_2023/populationsim_working_dir/output/
File Size: ~207 MB
Record Count: ~1,400,000 households
Update Frequency: Generated with each synthesis run
Required Fields Specification
Contains household-level characteristics including location, income, size, and structure.
| Column | Type | Description | Range/Values | Used By | Data Source/Notes |
|---|---|---|---|---|---|
HHID |
Integer | Unique household ID | 1 - 1,400,000 | HouseholdDataManager | Primary key, must be unique across all households |
TAZ_NODE |
Integer | TAZ of residence | 1 - 1,454 | HouseholdDataManager | Traffic Analysis Zone (1454 zones in Bay Area) |
MAZ_NODE |
Integer | MAZ of residence | 1 - 39,586 | HouseholdDataManager | Micro Analysis Zone (more detailed than TAZ) |
MAZ_SEQ |
Integer | Sequential MAZ ID | 1 - 39,586 | HouseholdDataManager | Index mapping |
TAZ_SEQ |
Integer | Sequential TAZ ID | 1 - 4,734 | HouseholdDataManager | Index mapping |
TAZ_ORIGINAL |
Integer | Pre-remapping TAZ | Historical | HouseholdDataManager | Version control |
MAZ_ORIGINAL |
Integer | Pre-remapping MAZ | Historical | HouseholdDataManager | Version control |
MTCCountyID |
Integer | County of residence | 1 - 9 | HouseholdDataManager | MTC county identifier code |
HHINCADJ |
Integer | Household income in 2010 dollars | 0 - 700,000 | HouseholdDataManager | Inflation-adjusted income for consistent modeling |
NWRKRS_ESR |
Integer | Number of workers | 0 - 20 | HouseholdDataManager | Count of employed persons, ranges 0-20 |
VEH |
Integer | Number of vehicles owned | 0 - 6, -9 | HouseholdDataManager | From PUMS: 0-6 vehicles, -9 for group quarters |
NP |
Integer | Number of persons in household | 1 - 20 | HouseholdDataManager | From PUMS: ranges 1-20 persons |
HHT |
Integer | Household type | 1 - 7, -9 | HouseholdDataManager | PUMS household type classification |
BLD |
Integer | Units in structure | 1 - 10, -9 | HouseholdDataManager | PUMS building type classification |
TYPE |
Integer | Type of unit | 1 - 3 | HouseholdDataManager | Housing unit vs. group quarters |
Household Type Codes (HHT)
Based on PUMS classification:
| Code | Description |
|---|---|
| 1 | Married-couple family household |
| 2 | Other family household, Male householder, no wife present |
| 3 | Other family household, Female householder, no husband present |
| 4 | Nonfamily household, Male householder, Living alone |
| 5 | Nonfamily household, Male householder, Not living alone |
| 6 | Nonfamily household, Female householder, Living alone |
| 7 | Nonfamily household, Female householder, Not living alone |
| -9 | N/A recoded for group quarters |
Building Type Codes (BLD)
Based on PUMS units in structure classification:
| Code | Description |
|---|---|
| 1 | Mobile home or trailer |
| 2 | One-family house detached |
| 3 | One-family house attached |
| 4 | 2 Apartments |
| 5 | 3-4 Apartments |
| 6 | 5-9 Apartments |
| 7 | 10-19 Apartments |
| 8 | 20-49 Apartments |
| 9 | 50 or more apartments |
| 10 | Boat, RV, van, etc. |
| -9 | N/A recoded for group quarters |
Unit Type Codes (TYPE)
| Code | Description |
|---|---|
| 1 | Housing unit |
| 2 | Institutional group quarters (typically excluded) |
| 3 | Noninstitutional group quarters |
Data Quality Specifications
Geographic Completeness: 100% of households assigned to valid TAZ/MAZ zones
Income Distribution: Matches ACS 2023 1-year estimates within ±5%
Household Size: Aligns with Census demographic patterns
Missing Values: Coded as -9 (standard TM2 convention)
Usage Applications
Transportation Modeling: Input to TM2 travel demand model
Land Use Planning: Household density and composition analysis
Economic Analysis: Income distribution and spending power assessment
Housing Policy: Tenure patterns and housing type analysis
TM2 Synthetic Persons
File: persons_2023_tm2.csv
Location: output_2023/populationsim_working_dir/output/
File Size: ~330 MB
Record Count: ~3,200,000 persons
Update Frequency: Generated with each synthesis run
Required Fields Specification
Contains individual-level characteristics for all persons within households.
| Column | Type | Description | Range/Values | Used By | Data Source/Notes |
|---|---|---|---|---|---|
HHID |
Integer | Household identifier | 1 - 1,400,000 | HouseholdDataManager | Foreign key linking to households table |
PERID |
String | Unique person identifier | “HHID_#” | HouseholdDataManager | Primary key, unique across all persons |
AGEP |
Integer | Age of person | 0 - 99 | HouseholdDataManager | From PUMS: ranges 0-99 years |
SEX |
Integer | Sex of person | 1, 2 | HouseholdDataManager | From PUMS: 1=Male, 2=Female |
SCHL |
Integer | Educational attainment | -9, 1 - 16 | HouseholdDataManager | PUMS education level codes |
OCCP |
Integer | Occupation category | -999, 1 - 6 | HouseholdDataManager | Recoded from PUMS occupation data |
WKHP |
Integer | Usual hours worked per week | -9, 1 - 99 | HouseholdDataManager | From PUMS: -9 if N/A, otherwise 1-99 |
WKW |
Integer | Weeks worked during past 12 months | -9, 1 - 6 | HouseholdDataManager | PUMS work weeks classification |
EMPLOYED |
Integer | Employment status | 0, 1 | HouseholdDataManager | Derived from ESR: 1=Employed, 0=Unemployed |
ESR |
Integer | Employment status recode | 0 - 6 | HouseholdDataManager | PUMS employment status |
SCHG |
Integer | Grade level attending | -9, 1 - 7 | HouseholdDataManager | PUMS school attendance level |
hhgqtype |
Integer | Household/GQ type | 1, 2 | HouseholdDataManager | 1=HH, 2=GQ |
person_type |
Integer | Employment-based type | 1 - 8 | HouseholdDataManager | Travel modeling |
Educational Attainment Codes (SCHL)
Based on PUMS education classification:
| Code | Description |
|---|---|
| -9 | N/A recoded for less than 3 years old |
| 1 | No schooling completed |
| 2 | Nursery school to grade 4 |
| 3 | Grade 5 or grade 6 |
| 4 | Grade 7 or grade 8 |
| 5 | Grade 9 |
| 6 | Grade 10 |
| 7 | Grade 11 |
| 8 | 12th grade, no diploma |
| 9 | High school graduate |
| 10 | Some college, but less than 1 year |
| 11 | One or more years of college, no degree |
| 12 | Associate’s degree |
| 13 | Bachelor’s degree |
| 14 | Master’s degree |
| 15 | Professional school degree |
| 16 | Doctorate degree |
Occupation Codes (OCCP)
Recoded from PUMS SOCP codes in create_seed_population.py:
| Code | Description |
|---|---|
| -999 | N/A (under 16 or not in labor force >5 years) |
| 1 | Management |
| 2 | Professional |
| 3 | Services |
| 4 | Retail |
| 5 | Manual |
| 6 | Military |
Employment Status Codes (ESR)
Based on PUMS employment status recode:
| Code | Description |
|---|---|
| 0 | N/A recoded for persons less than 16 years old |
| 1 | Civilian employed, at work |
| 2 | Civilian employed, with a job but not at work |
| 3 | Unemployed |
| 4 | Armed forces, at work |
| 5 | Armed forces, with a job but not at work |
| 6 | Not in labor force |
Work Weeks Classification (WKW)
| Code | Description |
|---|---|
| -9 | N/A (under 16 or didn’t work past 12 months) |
| 1 | 50 to 52 weeks |
| 2 | 48 to 49 weeks |
| 3 | 40 to 47 weeks |
| 4 | 27 to 39 weeks |
| 5 | 14 to 26 weeks |
| 6 | 13 weeks or less |
School Grade Level (SCHG)
Current school attendance level:
| Code | Description |
|---|---|
| -9 | N/A (not attending school) |
| 1 | Nursery school/preschool |
| 2 | Kindergarten |
| 3 | Grade 1 to grade 4 |
| 4 | Grade 5 to grade 8 |
| 5 | Grade 9 to grade 12 |
| 6 | College undergraduate |
| 7 | Graduate or professional school |
Data Quality Specifications
Age Distribution: Matches ACS demographic pyramids within ±5%
Employment Rates: Aligns with Bureau of Labor Statistics data
Educational Attainment: Consistent with Census educational distributions
Household Linkage: 100% of persons linked to valid households
Usage Applications
Travel Demand Modeling: Person-level trip generation and mode choice
Labor Market Analysis: Employment patterns and workforce characteristics
Education Planning: School enrollment and educational service needs
Demographic Research: Population structure and social characteristics
Validation and Quality Assurance Reports
Control vs. Result Summary
File: summary_melt.csv
Location: output_2023/populationsim_working_dir/output/
Purpose: Comprehensive comparison of synthesis results against demographic controls
Schema Specification
| Column | Type | Description | Example |
|---|---|---|---|
geography |
String | Geographic level | “TAZ_NODE”, “COUNTY” |
id |
Integer | Geographic identifier | 1001, 4 |
variable |
String | Control category | “hh_size_2”, “inc_lt_20k” |
control |
Float | Target control total | 150.0 |
result |
Float | Synthesis result | 148.5 |
diff |
Float | Absolute difference | -1.5 |
pct_diff |
Float | Percentage difference | -1.0% |
Quality Metrics
Control Categories: 32 demographic control variables
Geographic Levels: TAZ, MAZ, County (varying by control type)
Accuracy Standards: 95% of controls within ±5% tolerance
Coverage: All Bay Area geographic zones included
Geographic Summary Reports
TAZ-Level Summary
File: final_summary_TAZ_NODE.csv
Record Count: ~4,734 TAZ zones
Purpose: TAZ-level control matching and convergence statistics
Key Metrics:
- Control vs. result comparisons for all TAZ controls
- Household size, income, age, and worker distributions
- Convergence statistics and optimization performance
- Geographic clustering and spatial distribution patterns
County-Level Summaries
Files: final_summary_COUNTY_[1-9].csv
Record Count: 9 county files
Purpose: County-specific validation and occupation category analysis
Content:
- Person occupation distributions by county
- Employment category validation against ACS targets
- Regional consistency checks and scaling factor validation
- County-level demographic summary statistics
Processing Performance Reports
Memory and Timing Logs
Files:
mem.csv: Memory usage tracking during synthesistiming_log.csv: Step-by-step processing time analysispopulationsim.log: Comprehensive processing log
Performance Metrics:
- Peak memory usage: 6-8 GB during optimization
- Total processing time: 70-110 minutes for full synthesis
- Convergence iterations: Typically 200-400 iterations
- Step-level timing for optimization opportunities
Analysis and Visualization Products
Comprehensive Analysis Reports
Full Dataset Analysis
File: FULL_DATASET_ANALYSIS.md
Location: output_2023/
Purpose: Complete statistical analysis of synthetic population
Content:
- Demographic distribution summaries
- Geographic clustering analysis
- Income and employment pattern assessment
- Quality validation against reference datasets
- Trend analysis and historical comparisons
One-Page Executive Summary
File: ONE_PAGE_SUMMARY.md
Location: output_2023/
Purpose: Executive briefing on synthesis results and key findings
Highlights:
- Key demographic statistics and trends
- Control matching performance summary
- Geographic distribution highlights
- Quality assurance certification
- Recommendations for model applications
Interactive Analysis Products
Synthetic Population Dashboard
File: synthetic_population_analysis_2023.html
Location: output_2023/
Type: Interactive HTML dashboard with embedded visualizations
Features:
- County-level demographic comparisons
- Income distribution interactive charts
- Age pyramid visualizations
- Employment pattern analysis
- Geographic distribution mapping
Chart and Visualization Outputs
County Analysis Charts
Location: output_2023/charts/county_analysis/
Format: PNG and SVG chart files
Chart Types:
- Demographic pyramid comparisons
- Income distribution histograms
- Employment category breakdowns
- Household composition analysis
- Population density visualizations
TAZ Analysis Charts
Location: output_2023/charts/taz_analysis/
Format: PNG and SVG chart files
Chart Types:
- TAZ-level population distribution maps
- Household density heat maps
- Control matching accuracy plots
- Spatial clustering analysis
- Transportation zone performance metrics
Tableau-Ready Data Products
Geographic Boundaries
TAZ Boundaries
Files: taz_boundaries_tableau.* (shapefile set)
Location: output_2023/tableau/
Purpose: Transportation Analysis Zone boundaries for mapping
Specifications:
- Coordinate System: NAD83 California Albers
- Zone Count: 4,734 TAZ zones
- Join Field:
TAZ_ID(Integer) - Attributes: Zone identifiers, area, perimeter
PUMA Boundaries
Files: puma_boundaries_tableau.* (shapefile set)
Location: output_2023/tableau/
Purpose: Public Use Microdata Area boundaries for regional analysis
Specifications:
- Zone Count: 104 Bay Area PUMAs
- Join Fields:
PUMA_ID(String),PUMA_ID_INT(Integer) - Attributes: PUMA codes, county assignments, population totals
Control and Results Data
TAZ Controls and Results
File: taz_controls_results_tableau.csv
Location: output_2023/tableau/
Purpose: TAZ-level demographic data formatted for business intelligence
Schema:
TAZ_ID: Transportation Analysis Zone identifier- Control categories: All 32 demographic control variables
- Results: Synthesis outcomes for each control
- Performance metrics: Accuracy, convergence, and quality indicators
Geographic Crosswalk
File: geo_crosswalk_tableau.csv
Location: output_2023/tableau/
Purpose: Geographic relationship mapping for cross-level analysis
Relationships:
- MAZ ↔ TAZ ↔ County ↔ PUMA mappings
- Standardized join fields for seamless Tableau integration
- Population and household allocation factors
- Geographic hierarchy navigation
Tableau Integration Guide
File: README_Tableau_Data.md
Location: output_2023/tableau/
Purpose: Complete integration instructions for Tableau users
Content:
- Standardized join field specifications
- Recommended visualization approaches
- Data quality notes and limitations
- Performance optimization tips
- Example dashboard templates
Technical and Processing Outputs
PopulationSim Engine Outputs
Raw Synthesis Results
Files:
synthetic_households.csv: Unprocessed household resultssynthetic_persons.csv: Unprocessed person results
Purpose: Intermediate results before TM2 formatting and recoding
Usage: Debugging, algorithm validation, research applications
Processing Configuration
Files:
settings.yaml: PopulationSim algorithm configurationcontrols.csv: Control specification and importance weightslogging.yaml: Processing log configuration
Intermediate Processing Files
Seed Population Files
Location: output_2023/populationsim_working_dir/data/
Files:
seed_households.csv: PUMS-based seed household dataseed_persons.csv: PUMS-based seed person data
Purpose: Input data for synthesis engine, processed from raw PUMS
Control Data Files
Files:
maz_marginals_hhgq.csv: MAZ-level control totalstaz_marginals_hhgq.csv: TAZ-level control totalscounty_marginals.csv: County-level control totals
Purpose: Demographic targets for synthesis optimization
Geographic Files
Files:
geo_cross_walk_tm2_maz.csv: Primary geographic crosswalkgeo_cross_walk_tm2_block10.csv: Extended crosswalk with block groupsmazs_tazs_all_geog.csv: Complete geographic hierarchy
Diagnostic and Debug Outputs
Processing Logs
Files:
populationsim.log: Complete synthesis processing logpostprocess_recode.log: Post-processing operation log
Content:
- Step-by-step processing details
- Error messages and warnings
- Performance timing information
- Memory usage tracking
- Algorithm convergence monitoring
Pipeline State
File: pipeline.h5
Purpose: HDF5 format checkpoint data for pipeline state preservation
Usage: Resume interrupted processing, algorithm debugging
Output Usage Guidelines
Primary Use Cases by Audience
Transportation Planners
Primary Files:
households_2023_tm2.csv: TM2 travel demand model inputpersons_2023_tm2.csv: Person-level travel behavior analysis
Applications:
- Trip generation modeling
- Mode choice analysis
- Transportation equity assessment
- Infrastructure planning support
Policy Analysts
Primary Files:
- Tableau data products:
taz_controls_results_tableau.csv - Analysis reports:
FULL_DATASET_ANALYSIS.md
Applications:
- Demographic impact analysis
- Housing policy evaluation
- Economic development planning
- Social equity assessment
Researchers and Academics
Primary Files:
- Raw synthesis outputs:
synthetic_households.csv,synthetic_persons.csv - Validation reports:
summary_melt.csv
Applications:
- Population synthesis methodology research
- Demographic modeling validation
- Algorithm performance analysis
- Academic publication support
Data Analysts and Visualization Specialists
Primary Files:
- Tableau-ready datasets:
output_2023/tableau/ - Interactive dashboards:
synthetic_population_analysis_2023.html
Applications:
- Business intelligence dashboard creation
- Interactive visualization development
- Executive reporting and presentation
- Public engagement and communication
Data Integration Workflows
TM2 Travel Model Integration
Input Preparation:
- Use
households_2023_tm2.csvandpersons_2023_tm2.csvas primary inputs - Validate schema compliance with TM2 requirements
- Perform data quality checks using validation reports
- Configure ActivitySim with PopulationSim outputs
Quality Assurance:
- Verify household-person linkage integrity
- Check geographic assignment completeness
- Validate demographic distribution patterns
- Confirm income and employment consistency
Business Intelligence Integration
Tableau Workflow:
- Import geographic boundaries:
*_boundaries_tableau.* - Connect data sources:
*_tableau.csvfiles - Establish joins using standardized ID fields
- Apply recommended visualization templates
- Implement performance optimization guidelines
Power BI/Other BI Tools:
- Use CSV data products with standardized schemas
- Apply geographic joins using crosswalk files
- Implement data refresh procedures for updated synthesis runs
Data Quality Standards
Validation Criteria
Statistical Accuracy
Control Matching Standards:
- TAZ household totals: 95% within ±5% of control targets
- County demographics: 90% within ±10% of ACS estimates
- Regional population: Within ±1% of ACS 1-year regional total
- Income distributions: Statistical significance testing against ACS data
Geographic Consistency:
- 100% household assignment to valid TAZ/MAZ zones
- Complete county coverage for all Bay Area households
- Proper aggregation from MAZ → TAZ → County levels
- Spatial clustering patterns consistent with urban development
Data Integrity Checks
Referential Integrity:
- All persons linked to valid household records
- Geographic identifiers present and valid for all records
- Missing value coding consistent (-9 standard)
- Data type compliance with TM2 specifications
Logical Consistency:
- Household size matches person count per household
- Employment status consistent with occupation and income
- Age-appropriate school attendance and employment
- Vehicle availability reasonable for household characteristics
Quality Assurance Process
Automated Validation
Real-Time Checks:
# Example validation checks applied to all outputs
def validate_output_quality(households_df, persons_df):
checks = {
'household_person_linkage': validate_hh_person_links(households_df, persons_df),
'geographic_completeness': check_geographic_coverage(households_df),
'demographic_plausibility': assess_demographic_patterns(households_df, persons_df),
'control_matching_accuracy': compare_to_controls(households_df, persons_df),
'schema_compliance': verify_tm2_schema(households_df, persons_df)
}
return all(checks.values())
Manual Review Process
Quality Gates:
- Statistical Review: Demographic patterns assessment by subject matter experts
- Geographic Review: Spatial distribution analysis and validation
- Trend Analysis: Comparison with historical synthesis results
- External Validation: Cross-reference with alternative data sources
Documentation Requirements:
- Quality assurance certification for each synthesis run
- Deviation reporting for any controls outside tolerance ranges
- Peer review documentation for methodology changes
- Version control tracking for all output files
Error Handling and Recovery
Data Quality Issues
Common Issues:
- Geographic assignment failures: <0.1% of records
- Control matching failures: Address through algorithm tuning
- Schema compliance errors: Automated correction procedures
- Missing value patterns: Document and justify exceptions
Recovery Procedures:
- Automated re-processing for minor data quality issues
- Manual intervention protocols for significant deviations
- Rollback procedures for unacceptable quality outcomes
- Documentation requirements for all quality interventions
Conclusion
The TM2 PopulationSim output suite provides comprehensive support for transportation modeling, policy analysis, and demographic research through a carefully designed hierarchy of data products. Each output serves specific use cases while maintaining consistency and quality across the entire product ecosystem.
Key Strengths:
Comprehensive Coverage: From raw technical outputs to polished business intelligence products
- Complete demographic and geographic coverage of Bay Area
- Multiple formats optimized for different analytical workflows
- Comprehensive validation and quality assurance documentation
Quality Assurance: Multi-layered validation ensures reliability and accuracy
- Automated validation checks throughout processing pipeline
- Statistical accuracy standards exceeding industry benchmarks
- Comprehensive documentation supporting reproducibility and transparency
User-Focused Design: Outputs tailored to specific user communities and applications
- TM2-compliant formats for transportation modeling
- Tableau-ready products for business intelligence applications
- Research-grade outputs supporting academic and policy analysis
Integration-Ready: Seamless integration with downstream modeling and analysis systems
- Standardized schemas enabling automated processing
- Geographic consistency supporting spatial analysis
- Performance optimization supporting large-scale applications
This comprehensive output framework ensures that PopulationSim synthesis results effectively support the full spectrum of Bay Area planning, policy, and research applications while maintaining the highest standards of data quality and analytical rigor.