External Observed Data¶

Reference guide for observed data sources used for model validation.

Overview¶

The summaries generated from CTRAMP model runs can be compared against observed data from external sources like census data, employment surveys, and synthetic population outputs.

Available External Data Sources:

PopulationSim Summaries¶

Pre-generated PopulationSim summary files are stored in:

tm2py_utils/summary/validation/outputs/populationsim/

Files included: - households_by_county.csv - Household distribution by county - households_by_income.csv - Household income distribution
- households_by_workers.csv - Household worker distribution - household_size_regional.csv - Regional household size distribution - persons_by_age.csv - Population age distribution

Source: These files are generated from PopulationSim output (aggregate_summaries/) and copied here for reference alongside CTRAMP model outputs.

See the README in that directory for more details.

ACS (American Community Survey) Data¶

ACS observed data files are stored in:

tm2py_utils/summary/validation/outputs/observed/

Files included: - acs_auto_ownership_by_household_size.csv - acs_auto_ownership_by_household_size_county.csv - acs_auto_ownership_by_household_size_regional.csv

Other External Sources¶

Additional external data sources that may be useful for validation: - CTPP (Census Transportation Planning Products) - Journey-to-work patterns - External surveys - Employment surveys, regional studies - Other tabulated data - Pre-aggregated statistics

Note: This is for pre-aggregated summary data from external sources, not household travel survey microdata. Household travel surveys should be formatted to match the CTRAMP data model and processed as model inputs

System Architecture¶

┌──────────────────────────────────────────────┐
│ EXTERNAL DATA (ACS/CTPP/Surveys)             │
│ - Raw census tables                          │
│ - Pre-aggregated summaries                   │
│ - Different column names/formats             │
└──────────────────────────────────────────────┘
                    ↓
        YOU MUST PREPROCESS TO MATCH
        MODEL SUMMARY FORMAT
                    ↓
┌──────────────────────────────────────────────┐
│ PREPROCESSED EXTERNAL DATA                   │
│ - Same columns as model summaries            │
│ - Same categories (1,2,3,4+ households)      │
│ - Same geography (counties match model)      │
│ - CSV format with dataset column             │
└──────────────────────────────────────────────┘
                    ↓
        observed_summaries CONFIG
        (file paths + column mapping)
                    ↓
┌──────────────────────────────────────────────┐
│ SYSTEM MERGES                                │
│ Model Summary + External Data → Combined CSV │
└──────────────────────────────────────────────┘
                    ↓
┌──────────────────────────────────────────────┐
│ DASHBOARD VISUALIZATION                      │
│ Model vs. Observed comparisons               │
└──────────────────────────────────────────────┘

Key principle: You preprocess external data to match model format. The system does NOT automatically convert ACS/CTPP raw data.

Quick Start¶

Step 1: Preprocess External Data¶

Create a CSV matching your model summary format.

Example: ACS auto ownership by household size (regional)

Required columns (must match model summary): - num_persons (or aggregated field name) - num_vehicles - households (count) - share (percentage)

Example file: acs_auto_ownership_by_household_size_regional.csv

num_persons,num_vehicles,households,share
1,0,120000,0.30
1,1,180000,0.45
1,2,100000,0.25
2,0,80000,0.15
2,1,200000,0.38
2,2,250000,0.47
3,0,50000,0.10
3,1,150000,0.30
3,2,300000,0.60
4,0,40000,0.08
4,1,120000,0.24
4,2,340000,0.68

Step 2: Configure in validation_config.yaml¶

Add to observed_summaries section:

observed_summaries:
  - name: "acs_2023"
    display_name: "ACS 2023"
    summaries:
      auto_ownership_by_household_size_acs:
        file: "C:\\path\\to\\acs_auto_ownership_by_household_size_regional.csv"
        columns:
          num_persons_agg: "num_persons"  # Map 'num_persons' to model's 'num_persons_agg'
          num_vehicles: "num_vehicles"
          households: "households"
          share: "share"

Step 3: Create Matching Model Summary¶

Ensure you have a model summary with the same name and columns:

summaries:
  - name: "auto_ownership_by_household_size_acs"
    data_source: "households"
    group_by: ["num_persons_agg", "num_vehicles"]
    share_within: "num_persons_agg"

Step 4: Regenerate Summaries¶

conda activate tm2py-utils
cd C:\GitHub\tm2py-utils\tm2py_utils\summary\validation
python -m tm2py_utils.summary.validation.summaries.run_all --config validation_config.yaml

Output: Combined CSV with model + ACS data:

num_persons_agg,num_vehicles,households,share,dataset
1,0,150000,0.28,2023 TM2.2 v05
1,1,200000,0.37,2023 TM2.2 v05
1,2,190000,0.35,2023 TM2.2 v05
1,0,120000,0.30,ACS 2023
1,1,180000,0.45,ACS 2023
1,2,100000,0.25,ACS 2023
...

Configuration Details¶

observed_summaries Structure¶

observed_summaries:
  - name: "source_identifier"           # Internal name (no spaces)
    display_name: "Display Name"        # Name shown in dashboards
    summaries:
      summary_name_1:                   # Must match a model summary name
        file: "path/to/file.csv"
        columns:                        # Column mapping (model_col: file_col)
          model_column_1: "file_column_1"
          model_column_2: "file_column_2"
      summary_name_2:
        file: "path/to/another_file.csv"
        columns:
          ...

Fields:

Field	Required	Description	Example
`name`	✅	Internal identifier	`"acs_2023"`
`display_name`	✅	Dashboard label	`"ACS 2023"`
`summaries`	✅	Dictionary of summaries to load	See below

Summary configuration:

Field	Required	Description
`file`	✅	Absolute path to CSV file
`columns`	⚠️	Column name mapping (optional if names match exactly)

Column Mapping¶

Maps external data column names to model column names.

Syntax:

columns:
  model_column_name: "external_file_column_name"

Example 1: Column names differ

columns:
  num_persons_agg: "hh_size"      # Model uses 'num_persons_agg', file has 'hh_size'
  num_vehicles: "vehicles"        # Model uses 'num_vehicles', file has 'vehicles'
  households: "count"             # Model uses 'households', file has 'count'
  share: "percentage"             # Model uses 'share', file has 'percentage'

Example 2: Column names match (no mapping needed)

columns:
  num_persons_agg: "num_persons_agg"
  num_vehicles: "num_vehicles"
  households: "households"
  share: "share"

Or omit columns entirely if all names match.

Data Format Requirements¶

Required Columns¶

External data files must contain:

Dimension columns - Same as model summary's group_by
Metric columns - Usually households, persons, tours, or trips
Share column - Percentage (0.0 to 1.0 or 0 to 100)

Do NOT include dataset column - the system adds this automatically.

Data Types¶

Column Type	Format	Example
Categorical dimensions	String or integer	`"Alameda"`, `1`
Count metrics	Integer or float	`150000`, `150000.5`
Shares	Float (0-1)	`0.25` (25%)

Category Alignment¶

Critical: External data categories must match model aggregations.

Example: Household size

Model uses: num_persons_agg with values 1, 2, 3, 4 (4 = "4+")

ACS raw data has: 1, 2, 3, 4, 5, 6, 7+

You must aggregate ACS: 1, 2, 3, 4+ (combine 4, 5, 6, 7+ → 4)

How to aggregate: - See model's aggregation_specs in validation_config.yaml - Match those category definitions in your preprocessing - Use same labels (strings must match exactly)

Geography Alignment¶

County names must match exactly:

Model geography:

Alameda
Contra Costa
Marin
Napa
San Francisco
San Mateo
Santa Clara
Solano
Sonoma

External data must use identical spelling and capitalization.

Preprocessing Examples¶

Example 1: ACS Household Size by Vehicles¶

Source: ACS Table B08201 (Household Size by Vehicles Available)

Raw ACS format:

geography,grouping,universe,share
Bay Area,"Total:",2490000,1.0
Bay Area,"Total: 1-person household:",400000,0.161
Bay Area,"Total: 1-person household: No vehicle available",120000,0.048
Bay Area,"Total: 1-person household: 1 vehicle available",200000,0.080
Bay Area,"Total: 1-person household: 2 vehicles available",80000,0.032
...

Preprocessing script: convert_acs_data.py

import pandas as pd

# Load raw ACS data
df = pd.read_csv('acs_raw.csv')

# Parse grouping labels
def parse_label(label):
    if '1-person household:' in label:
        persons = '1'
    elif '2-person household:' in label:
        persons = '2'
    elif '3-person household:' in label:
        persons = '3'
    elif '4-or-more-person household:' in label:
        persons = '4+'  # Aggregated
    else:
        return None, None

    if 'No vehicle' in label:
        vehicles = 0
    elif '1 vehicle' in label:
        vehicles = 1
    elif '2 vehicles' in label:
        vehicles = 2
    elif '3 vehicles' in label:
        vehicles = 3
    elif '4 or more' in label:
        vehicles = 4
    else:
        return None, None

    return persons, vehicles

# Extract detail rows
records = []
for _, row in df.iterrows():
    persons, vehicles = parse_label(row['grouping'])
    if persons and vehicles is not None:
        records.append({
            'num_persons': persons,
            'num_vehicles': vehicles,
            'households': row['universe'],
            'share': row['share']
        })

result = pd.DataFrame(records)

# Recalculate shares within household size
result['share'] = result.groupby('num_persons')['households'].transform(
    lambda x: x / x.sum()
)

result.to_csv('acs_auto_ownership_by_household_size_regional.csv', index=False)

Output:

num_persons,num_vehicles,households,share
1,0,120000,0.30
1,1,200000,0.50
1,2,80000,0.20
2,0,80000,0.15
...

Example 2: CTPP Journey to Work¶

Source: CTPP Table A302 (Place of Work by Residence)

Goal: Compare commute patterns

Preprocessing:

import pandas as pd

# Load CTPP data
ctpp = pd.read_csv('ctpp_work_flows.csv')

# Map TAZs to counties
taz_to_county = pd.read_csv('taz_county_lookup.csv')

# Aggregate to county-to-county flows
flows = ctpp.merge(
    taz_to_county.rename(columns={'county': 'home_county'}),
    left_on='residence_taz',
    right_on='taz'
).merge(
    taz_to_county.rename(columns={'county': 'work_county'}),
    left_on='workplace_taz',
    right_on='taz'
)

result = flows.groupby(['home_county', 'work_county'])['workers'].sum().reset_index()
result['share'] = result.groupby('home_county')['workers'].transform(lambda x: x / x.sum())

result.to_csv('ctpp_work_location_by_home_county.csv', index=False)

Example 3: External Employment Survey¶

Source: Regional employment survey by industry

Goal: Compare employment distribution

Format to match model:

import pandas as pd

survey = pd.read_csv('employment_survey.csv')

# Map survey categories to model person_types
category_map = {
    'Full-time': 1,
    'Part-time': 2,
    'Student': 3,
    # etc.
}

survey['person_type'] = survey['employment_category'].map(category_map)

result = survey.groupby('person_type')['persons'].sum().reset_index()
result['share'] = result['persons'] / result['persons'].sum()

result.to_csv('survey_employment_distribution.csv', index=False)

Complete Configuration Example¶

Scenario: Compare Model to ACS 2023¶

Model summaries to validate: 1. Auto ownership by household size (regional) 2. Auto ownership by household size (county-level)

External data: - ACS 2023 data, preprocessed to match model format

Configuration:

# validation_config.yaml

# Model summaries (generate from model data)
summaries:
  - name: "auto_ownership_by_household_size_acs"
    description: "Vehicle ownership by household size (ACS categories)"
    data_source: "households"
    group_by: ["num_persons_agg", "num_vehicles"]
    weight_field: "sample_rate"
    count_name: "households"
    share_within: "num_persons_agg"

  - name: "auto_ownership_by_household_size_county"
    description: "Vehicle ownership by household size and county"
    data_source: "households"
    group_by: ["county", "num_persons_agg", "num_vehicles"]
    weight_field: "sample_rate"
    count_name: "households"
    share_within: ["county", "num_persons_agg"]

# External data (load from preprocessed files)
observed_summaries:
  - name: "acs_2023"
    display_name: "ACS 2023"
    summaries:
      # Regional comparison
      auto_ownership_by_household_size_acs:
        file: "C:\\data\\acs\\acs_auto_ownership_by_household_size_regional.csv"
        columns:
          num_persons_agg: "num_persons"
          num_vehicles: "num_vehicles"
          households: "households"
          share: "share"

      # County-level comparison
      auto_ownership_by_household_size_county:
        file: "C:\\data\\acs\\acs_auto_ownership_by_household_size_county.csv"
        columns:
          county: "county"
          num_persons_agg: "num_persons"
          num_vehicles: "num_vehicles"
          households: "households"
          share: "share"

# Aggregation spec (model and ACS must use same categories)
aggregation_specs:
  num_persons_agg:
    apply_to: ["num_persons"]
    mapping:
      1: 1
      2: 2
      3: 3
      4: 4   # 4+ aggregation
      5: 4
      6: 4
      7: 4
      8: 4
      9: 4
      10: 4

Execution and Output¶

Run Summary Generation¶

python -m tm2py_utils.summary.validation.summaries.run_all --config validation_config.yaml

Log output:

INFO - Loading data from 2023_version_05: A:\2023-tm22-dev-version-05\ctramp_output
INFO -   ✓ Loaded households: 2,490,000 records
...
INFO - Loading pre-aggregated summaries from acs_2023: ACS 2023
INFO -   ✓ Loaded auto_ownership_by_household_size_acs: 20 rows from acs_auto_ownership_by_household_size_regional.csv
INFO -   ✓ Loaded auto_ownership_by_household_size_county: 180 rows from acs_auto_ownership_by_household_size_county.csv
...
INFO - Combining multi-run summaries...
INFO -   ✓ Saved auto_ownership_by_household_size_acs.csv: 60 rows (3 datasets)
INFO -   ✓ Saved auto_ownership_by_household_size_county.csv: 539 rows (3 datasets)

Output File Structure¶

Combined file: auto_ownership_by_household_size_acs.csv

num_persons_agg,num_vehicles,households,share,dataset
1,0,150000,0.28,2023 TM2.2 v05
1,1,200000,0.37,2023 TM2.2 v05
1,2,190000,0.35,2023 TM2.2 v05
1,0,130000,0.26,2015 TM2.2 Sprint 04
1,1,210000,0.42,2015 TM2.2 Sprint 04
1,2,160000,0.32,2015 TM2.2 Sprint 04
1,0,120000,0.30,ACS 2023
1,1,180000,0.45,ACS 2023
1,2,100000,0.25,ACS 2023
2,0,80000,0.15,2023 TM2.2 v05
2,1,200000,0.38,2023 TM2.2 v05
...

Dataset column values: - 2023 TM2.2 v05 - From model run - 2015 TM2.2 Sprint 04 - From older model run - ACS 2023 - From external data (display_name in config)

Troubleshooting¶

External Data Not Appearing¶

WARNING - Summary file not found: C:\...\acs_data.csv

Solutions: 1. Check file path is absolute (not relative) 2. Verify file exists at specified location 3. Check for typos in path

Column Not Found¶

KeyError: 'num_persons_agg'

Cause: Column mapping incorrect

Solution: Verify column names in external file match columns mapping

Mismatched Categories¶

Dashboard shows: Model has 4+ households, ACS shows 4, 5, 6, 7+

Cause: External data not aggregated to match model

Solution: Preprocess external data to combine 4, 5, 6, 7+ → "4+"

Wrong Summary Name¶

WARNING - No model summary found matching 'wrong_name'

Cause: observed_summaries key doesn't match any model summary name

Solution: Ensure summaries: keys match summaries[].name in config exactly

Shares Don't Match¶

Example: ACS share = 0.30, but model share calculated differently

Cause: Different share_within grouping

Model:

share_within: ["county", "num_persons_agg"]  # Share within county × household size

External data: Share might be regional (not within groups)

Solution: Recalculate shares in preprocessing to match model's grouping

Best Practices¶

Match aggregations first - Review model's aggregation_specs before preprocessing
Use absolute paths - Avoid relative paths in file specifications
Standardize geography - County names must match exactly (case-sensitive)
Document preprocessing - Keep scripts that generate external data files
Version control - Track which ACS/CTPP year/version you're using
Test with one summary - Validate workflow before adding multiple summaries
Check shares add to 1.0 - Within appropriate grouping levels

Data Source Guidelines¶

ACS (American Community Survey)¶

Recommended tables: - B08201 - Household Size by Vehicles - B08134 - Means of Transportation to Work - B08303 - Travel Time to Work - B19001 - Household Income

Aggregation notes: - Household size: Use 1, 2, 3, 4+ categories - Vehicles: 0, 1, 2, 3, 4+ (ACS has "3 or more", model might have separate 3 and 4+)

CTPP (Census Transportation Planning Products)¶

Recommended tables: - A302 - Place of Work by Residence - A201 - Journey to Work Flows - A103 - Travel Time to Work

Geography notes: - CTPP uses TAZs → aggregate to counties for comparison - Maintain lookup tables for TAZ-to-county mapping

Employment Surveys¶

Considerations: - Map employment categories to model's person_type codes - Ensure sample weights/expansion factors applied - Match reference year to model year

Next Steps¶

Generate Summaries - Run the full summary generation
Deploy Dashboard - Visualize model vs. observed comparisons
Custom Summaries - Create new summary definitions
Data Model Reference - Understand model data format