Create Custom Summaries¶
Guide to defining new validation summaries using YAML configuration.
Overview¶
The validation system is fully config-driven - you can create new summaries without writing Python code. All summaries are defined in validation_config.yaml under the summaries section.
Design Principles
This configuration-driven approach follows the Data-Driven Configuration principle from our design system. For a complete understanding of the system architecture, see the Summary Design System Plan.
What you can configure: - Which data table to use (households, persons, tours, trips) - How to group data (dimensions) - Which metrics to calculate (counts, shares, percentages) - Filters to apply - How to bin continuous variables - How to aggregate categories
No coding required - just edit YAML and regenerate.
Basic Summary Structure¶
Minimal Example¶
summaries:
- name: "auto_ownership_regional"
data_source: "households"
group_by: "num_vehicles"
weight_field: "sample_rate"
count_name: "households"
Output:
num_vehicles,households,share,dataset
0,150000,0.25,2023 TM2.2 v05
1,200000,0.33,2023 TM2.2 v05
2,180000,0.30,2023 TM2.2 v05
Required Fields¶
| Field | Type | Description | Example |
|---|---|---|---|
name |
string | Unique identifier (becomes filename) | "auto_ownership_regional" |
data_source |
string | Which table to query | "households", "persons", "individual_tours", "individual_trips" |
group_by |
string or list | Columns to group by | "num_vehicles" or ["county", "num_vehicles"] |
Optional Fields¶
| Field | Type | Description | Default | Example |
|---|---|---|---|---|
weight_field |
string | Column containing expansion weights | "sample_rate" |
"sample_rate" |
count_name |
string | Name for count metric | Based on data_source | "households", "tours" |
share_within |
string or list | Calculate shares within groups | null (regional) |
"income_category" |
description |
string | Human-readable description | "" |
"Vehicle ownership by income" |
filter |
string | Filter expression (pandas query syntax) | null (no filter) |
"tour_purpose == 'Work'" |
calculate_share |
boolean | Whether to calculate share column | true |
false |
Data Sources¶
The data_source field determines which table is queried:
| Data Source | Description | Available Columns |
|---|---|---|
households |
Household demographics | num_vehicles, num_persons, num_workers, income_category, county, county_name, etc. |
persons |
Person characteristics | person_type, age, gender, cdap, value_of_time, etc. |
individual_tours |
Tour-level data | tour_purpose, tour_mode, start_period, end_period, tour_distance, etc. |
individual_trips |
Trip-level data | trip_mode, origin_purpose, destination_purpose, trip_distance, etc. |
workplace_school_location |
Work/school locations | work_location, school_location, work_location_distance, etc. |
See CTRAMP Data Model for complete column lists.
Column naming:
- After data loading, columns use standardized internal names
- hh_id → household_id
- size → num_persons
- autos → num_vehicles
Use the internal names in your group_by specifications.
Grouping Dimensions¶
Single Dimension¶
Output: One row per CDAP value (M, N, H)
Multiple Dimensions¶
- name: "auto_ownership_by_income"
data_source: "households"
group_by: ["income_category", "num_vehicles"]
count_name: "households"
Output: One row per combination (income × vehicles)
income_category,num_vehicles,households,share,dataset
1,0,50000,0.15,2023 TM2.2 v05
1,1,80000,0.24,2023 TM2.2 v05
1,2,70000,0.21,2023 TM2.2 v05
2,0,40000,0.12,2023 TM2.2 v05
...
Cross-Tabulations¶
For mode × purpose, destination × origin, etc.:
- name: "tour_mode_by_purpose"
data_source: "individual_tours"
group_by: ["tour_purpose", "tour_mode"]
share_within: "tour_purpose" # Share within each purpose
Output: Mode distribution for each tour purpose
Share Calculations¶
Regional Shares (Default)¶
Output: Share = count / total across all records
Share adds to 1.0 (or 100%)
Shares Within Groups¶
- name: "auto_ownership_by_income"
data_source: "households"
group_by: ["income_category", "num_vehicles"]
share_within: "income_category"
Output: Share = count / total within each income category
income_category,num_vehicles,households,share
1,0,50000,0.33 # Share within income=1
1,1,80000,0.53
1,2,20000,0.13
2,0,30000,0.20 # Share within income=2
2,1,90000,0.60
2,2,30000,0.20
Within each income_category, shares add to 1.0.
Multiple Grouping Levels¶
- name: "auto_ownership_by_household_size_county"
data_source: "households"
group_by: ["county", "num_persons_agg", "num_vehicles"]
share_within: ["county", "num_persons_agg"]
Output: Share within each county × household_size combination
county,num_persons_agg,num_vehicles,households,share
Alameda,1,0,57000,0.28 # Share within Alameda 1-person households
Alameda,1,1,100000,0.50
Alameda,1,2,44000,0.22
Alameda,2,0,40000,0.15 # Share within Alameda 2-person households
...
Shares add to 1.0 within each (county, num_persons_agg) group.
Disable Share Calculation¶
For raw counts without percentages:
- name: "time_of_day_tours"
data_source: "individual_tours"
group_by: ["start_period", "end_period"]
calculate_share: false
Filtering Data¶
Use pandas query syntax to filter records before aggregation.
Simple Filters¶
Filter to work tours only:
- name: "work_tour_modes"
data_source: "individual_tours"
filter: "tour_purpose == 'Work'"
group_by: "tour_mode"
Filter to adults (age 18+):
Filter to specific county:
- name: "sf_auto_ownership"
data_source: "households"
filter: "county_name == 'San Francisco'"
group_by: "num_vehicles"
Complex Filters¶
Multiple conditions (AND):
Multiple conditions (OR):
List membership:
Numeric ranges:
String matching:
Filter Execution¶
Filters are applied before aggregation:
- Load data
- Apply filter → Reduce rows
- Group by dimensions
- Calculate counts and shares
Log output:
Binning Continuous Variables¶
Convert continuous variables (distance, time, age, income) into categories.
Define Bins Globally¶
In validation_config.yaml:
binning_specs:
tour_distance:
bins: [0, 5, 10, 20, 50, 1000]
labels: ['0-5', '5-10', '10-20', '20-50', '50+']
worker_age:
bins: [0, 25, 35, 45, 55, 65, 120]
labels: ['<25', '25-34', '35-44', '45-54', '55-64', '65+']
Use Binned Columns¶
The system automatically creates {column}_bin columns:
- name: "tour_distance_distribution"
data_source: "individual_tours"
group_by: "tour_distance_bin" # Uses binning_specs.tour_distance
Output:
tour_distance_bin,tours,share,dataset
0-5,250000,0.40,2023 TM2.2 v05
5-10,180000,0.29,2023 TM2.2 v05
10-20,120000,0.19,2023 TM2.2 v05
20-50,60000,0.10,2023 TM2.2 v05
50+,15000,0.02,2023 TM2.2 v05
Binning Configuration¶
| Field | Description | Example |
|---|---|---|
bins |
Bin edges (inclusive lower, exclusive upper) | [0, 5, 10, 20, 50, 1000] |
labels |
Category labels (one less than bins) | ['0-5', '5-10', '10-20', '20-50', '50+'] |
Bin behavior:
- bins: [0, 5, 10] creates: [0, 5) and [5, 10)
- First bin includes lower edge: [0, 5) → 0 ≤ x < 5
- Last bin includes upper edge: [50, 1000] → 50 ≤ x ≤ 1000
Pre-Configured Bins¶
The system includes these bin specifications:
| Variable | Bins | Labels |
|---|---|---|
tour_distance |
0, 5, 10, 20, 50, 1000 | 0-5, 5-10, 10-20, 20-50, 50+ |
trip_distance |
0, 1, 3, 5, 10, 20, 1000 | 0-1, 1-3, 3-5, 5-10, 10-20, 20+ |
tour_duration |
0, 30, 60, 120, 240, 10000 | 0-30, 30-60, 60-120, 120-240, 240+ |
trip_duration |
0, 5, 10, 15, 30, 60, 10000 | 0-5, 5-10, 10-15, 15-30, 30-60, 60+ |
worker_age |
0, 25, 35, 45, 55, 65, 120 | <25, 25-34, 35-44, 45-54, 55-64, 65+ |
income_category |
0, 30000, 60000, 100000, 150000, 1000000000 | <30K, 30-60K, 60-100K, 100-150K, 150K+ |
Aggregating Categories¶
Combine detailed categories into broader groups (e.g., 4+ person households, Auto/Transit/Active modes).
Define Aggregations Globally¶
In validation_config.yaml:
aggregation_specs:
# Household size: 1, 2, 3, 4+ (match ACS categories)
num_persons_agg:
apply_to: ["num_persons"]
mapping:
1: 1
2: 2
3: 3
4: 4
5: 4
6: 4
7: 4
8: 4
9: 4
10: 4
Use Aggregated Columns¶
The system automatically creates {column}_agg columns:
- name: "auto_ownership_by_household_size_acs"
data_source: "households"
group_by: ["num_persons_agg", "num_vehicles"] # Uses aggregation_specs.num_persons_agg
share_within: "num_persons_agg"
Before aggregation (num_persons):
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 (10 categories)
After aggregation (num_persons_agg):
- 1, 2, 3, 4 (4 categories, where 4 = "4+")
Transportation Mode Aggregation¶
Group 17 detailed modes into broader categories:
aggregation_specs:
transportation_mode:
apply_to: ["tour_mode", "trip_mode"]
mapping:
1: "Auto - SOV" # SOV_GP
2: "Auto - SOV" # SOV_PAY
3: "Auto - Shared" # SR2_GP
4: "Auto - Shared" # SR2_HOV
5: "Auto - Shared" # SR2_PAY
6: "Auto - Shared" # SR3_GP
7: "Auto - Shared" # SR3_HOV
8: "Auto - Shared" # SR3_PAY
9: "Active" # WALK
10: "Active" # BIKE
11: "Transit" # WLK_TRN
12: "Transit" # PNR_TRN
13: "Transit" # KNRPRV_TRN
14: "Transit" # KNRTNC_TRN
15: "TNC/Taxi" # TAXI
16: "TNC/Taxi" # TNC
17: "School Bus" # SCHLBUS
Usage:
- name: "tour_mode_choice_aggregated"
data_source: "individual_tours"
group_by: "tour_mode_agg" # Uses transportation_mode aggregation
Output: 6 categories instead of 17
Aggregation vs. Binning¶
| Binning | Aggregation | |
|---|---|---|
| Input | Continuous numeric | Categorical (codes/text) |
| Process | Create ranges | Map values to groups |
| Example | Distance → bins | 17 modes → 6 groups |
| Suffix | _bin |
_agg |
Complete Examples¶
Example 1: Simple Distribution¶
Goal: Vehicle ownership distribution
- name: "auto_ownership_regional"
description: "Regional vehicle ownership distribution"
data_source: "households"
group_by: "num_vehicles"
weight_field: "sample_rate"
count_name: "households"
Output:
num_vehicles,households,share,dataset
0,150000,0.25,2023 TM2.2 v05
1,200000,0.33,2023 TM2.2 v05
2,180000,0.30,2023 TM2.2 v05
3,72000,0.12,2023 TM2.2 v05
Example 2: Cross-Tabulation with Shares¶
Goal: Mode choice by tour purpose
- name: "tour_mode_by_purpose"
description: "Tour mode by purpose cross-tabulation"
data_source: "individual_tours"
group_by: ["tour_purpose", "tour_mode"]
weight_field: "sample_rate"
count_name: "tours"
share_within: "tour_purpose"
Output:
tour_purpose,tour_mode,tours,share,dataset
Work,1,80000,0.40,2023 TM2.2 v05
Work,9,30000,0.15,2023 TM2.2 v05
Work,11,60000,0.30,2023 TM2.2 v05
Shop,1,120000,0.60,2023 TM2.2 v05
Shop,9,80000,0.40,2023 TM2.2 v05
Shares add to 1.0 within each tour_purpose.
Example 3: Filtered + Binned¶
Goal: Work tour distance distribution (work tours only)
- name: "work_tour_distance"
description: "Distance distribution for work tours"
data_source: "individual_tours"
filter: "tour_purpose == 'Work'"
group_by: "tour_distance_bin"
weight_field: "sample_rate"
count_name: "tours"
Output: Distance distribution for work tours only (non-work filtered out)
Example 4: Multi-Level Grouping¶
Goal: Auto ownership by income and household size
- name: "auto_ownership_by_income_and_size"
description: "Vehicle ownership by income and household size"
data_source: "households"
group_by: ["income_category", "num_persons", "num_vehicles"]
weight_field: "sample_rate"
count_name: "households"
share_within: ["income_category", "num_persons"]
Output: Shares add to 1.0 within each (income, household size) combination
Example 5: Aggregated Categories¶
Goal: Simplified mode choice (Auto/Transit/Active/Other)
- name: "tour_mode_choice_aggregated"
description: "Tour mode choice (aggregated categories)"
data_source: "individual_tours"
group_by: "tour_mode_agg"
weight_field: "sample_rate"
count_name: "tours"
Output:
tour_mode_agg,tours,share,dataset
Auto - SOV,450000,0.45,2023 TM2.2 v05
Auto - Shared,200000,0.20,2023 TM2.2 v05
Transit,250000,0.25,2023 TM2.2 v05
Active,80000,0.08,2023 TM2.2 v05
TNC/Taxi,20000,0.02,2023 TM2.2 v05
Example 6: Geographic Summary¶
Goal: Auto ownership by county
- name: "auto_ownership_by_county"
description: "Vehicle ownership by county"
data_source: "households"
group_by: ["county_name", "num_vehicles"]
weight_field: "sample_rate"
count_name: "households"
share_within: "county_name"
Output: Vehicle distribution for each Bay Area county
Common Patterns¶
Pattern 1: Regional Distribution¶
Pattern 2: Conditional Distribution¶
group_by: ["condition", "variable"]
share_within: "condition"
# Share of 'variable' within each 'condition'
Pattern 3: Filtered Subset¶
Pattern 4: Multi-Dimensional¶
group_by: ["dim1", "dim2", "dim3"]
share_within: ["dim1", "dim2"]
# Share within each (dim1, dim2) combination
Troubleshooting¶
Empty Output¶
Causes: - Filter removed all records - Missing column in data - Column has all null values
Solutions: 1. Check filter expression 2. Verify column exists: see Data Model 3. Check for typos in column names
Missing Columns¶
Solution: Add binning spec to binning_specs section for tour_distance
Wrong Data Source¶
Cause: Using data_source: "households" but tour_mode is in "individual_tours"
Solution: Use correct data source table
Shares Don't Add to 1.0¶
Cause: Using wrong share_within grouping
Example:
group_by: ["county", "income", "vehicles"]
share_within: "county" # Wrong - shares within county, not county×income
Fix:
Best Practices¶
- Use descriptive names -
auto_ownership_by_incomenotsummary_1 - Add descriptions - Helps future users understand purpose
- Start simple - Test with single dimension before adding complexity
- Check shares - Verify they add to 1.0 within expected groups
- Use aggregations - For matching external data categories (ACS uses 4+ households)
- Leverage bins - Pre-configured bins for common variables
- Filter early - Reduces processing time
- Test incrementally - Add one feature at a time
Validation¶
Check your summary configuration:
Checks: - Required fields present - Data source valid - Group-by columns exist - Filter syntax correct - Binning specs defined
Next Steps¶
- Generate Summaries - Run the summary generation
- External Data Reference - Location of observed data files
- Data Model Reference - See available columns
- Configuration Guide - YAML documentation