Create Custom Summaries¶

Guide to defining new validation summaries using YAML configuration.

Overview¶

The validation system is fully config-driven - you can create new summaries without writing Python code. All summaries are defined in validation_config.yaml under the summaries section.

Design Principles

This configuration-driven approach follows the Data-Driven Configuration principle from our design system. For a complete understanding of the system architecture, see the Summary Design System Plan.

What you can configure: - Which data table to use (households, persons, tours, trips) - How to group data (dimensions) - Which metrics to calculate (counts, shares, percentages) - Filters to apply - How to bin continuous variables - How to aggregate categories

No coding required - just edit YAML and regenerate.

Basic Summary Structure¶

Minimal Example¶

summaries:
  - name: "auto_ownership_regional"
    data_source: "households"
    group_by: "num_vehicles"
    weight_field: "sample_rate"
    count_name: "households"

Output:

num_vehicles,households,share,dataset
0,150000,0.25,2023 TM2.2 v05
1,200000,0.33,2023 TM2.2 v05
2,180000,0.30,2023 TM2.2 v05

Required Fields¶

Field	Type	Description	Example
`name`	string	Unique identifier (becomes filename)	`"auto_ownership_regional"`
`data_source`	string	Which table to query	`"households"`, `"persons"`, `"individual_tours"`, `"individual_trips"`
`group_by`	string or list	Columns to group by	`"num_vehicles"` or `["county", "num_vehicles"]`

Optional Fields¶

Field	Type	Description	Default	Example
`weight_field`	string	Column containing expansion weights	`"sample_rate"`	`"sample_rate"`
`count_name`	string	Name for count metric	Based on data_source	`"households"`, `"tours"`
`share_within`	string or list	Calculate shares within groups	`null` (regional)	`"income_category"`
`description`	string	Human-readable description	`""`	`"Vehicle ownership by income"`
`filter`	string	Filter expression (pandas query syntax)	`null` (no filter)	`"tour_purpose == 'Work'"`
`calculate_share`	boolean	Whether to calculate share column	`true`	`false`

Data Sources¶

The data_source field determines which table is queried:

Data Source	Description	Available Columns
`households`	Household demographics	`num_vehicles`, `num_persons`, `num_workers`, `income_category`, `county`, `county_name`, etc.
`persons`	Person characteristics	`person_type`, `age`, `gender`, `cdap`, `value_of_time`, etc.
`individual_tours`	Tour-level data	`tour_purpose`, `tour_mode`, `start_period`, `end_period`, `tour_distance`, etc.
`individual_trips`	Trip-level data	`trip_mode`, `origin_purpose`, `destination_purpose`, `trip_distance`, etc.
`workplace_school_location`	Work/school locations	`work_location`, `school_location`, `work_location_distance`, etc.

See CTRAMP Data Model for complete column lists.

Column naming: - After data loading, columns use standardized internal names - hh_id → household_id - size → num_persons - autos → num_vehicles

Use the internal names in your group_by specifications.

Grouping Dimensions¶

Single Dimension¶

- name: "cdap_distribution"
  data_source: "persons"
  group_by: "cdap"
  count_name: "persons"

Output: One row per CDAP value (M, N, H)

Multiple Dimensions¶

- name: "auto_ownership_by_income"
  data_source: "households"
  group_by: ["income_category", "num_vehicles"]
  count_name: "households"

Output: One row per combination (income × vehicles)

income_category,num_vehicles,households,share,dataset
1,0,50000,0.15,2023 TM2.2 v05
1,1,80000,0.24,2023 TM2.2 v05
1,2,70000,0.21,2023 TM2.2 v05
2,0,40000,0.12,2023 TM2.2 v05
...

Cross-Tabulations¶

For mode × purpose, destination × origin, etc.:

- name: "tour_mode_by_purpose"
  data_source: "individual_tours"
  group_by: ["tour_purpose", "tour_mode"]
  share_within: "tour_purpose"  # Share within each purpose

Output: Mode distribution for each tour purpose

Regional Shares (Default)¶

- name: "auto_ownership_regional"
  data_source: "households"
  group_by: "num_vehicles"

Output: Share = count / total across all records

num_vehicles,households,share
0,150000,0.25
1,200000,0.33
2,180000,0.30
3,120000,0.20

Share adds to 1.0 (or 100%)

Shares Within Groups¶

- name: "auto_ownership_by_income"
  data_source: "households"
  group_by: ["income_category", "num_vehicles"]
  share_within: "income_category"

Output: Share = count / total within each income category

income_category,num_vehicles,households,share
1,0,50000,0.33    # Share within income=1
1,1,80000,0.53
1,2,20000,0.13
2,0,30000,0.20    # Share within income=2
2,1,90000,0.60
2,2,30000,0.20

Within each income_category, shares add to 1.0.

Multiple Grouping Levels¶

- name: "auto_ownership_by_household_size_county"
  data_source: "households"
  group_by: ["county", "num_persons_agg", "num_vehicles"]
  share_within: ["county", "num_persons_agg"]

Output: Share within each county × household_size combination

county,num_persons_agg,num_vehicles,households,share
Alameda,1,0,57000,0.28    # Share within Alameda 1-person households
Alameda,1,1,100000,0.50
Alameda,1,2,44000,0.22
Alameda,2,0,40000,0.15    # Share within Alameda 2-person households
...

Shares add to 1.0 within each (county, num_persons_agg) group.

For raw counts without percentages:

- name: "time_of_day_tours"
  data_source: "individual_tours"
  group_by: ["start_period", "end_period"]
  calculate_share: false

Filtering Data¶

Use pandas query syntax to filter records before aggregation.

Simple Filters¶

Filter to work tours only:

- name: "work_tour_modes"
  data_source: "individual_tours"
  filter: "tour_purpose == 'Work'"
  group_by: "tour_mode"

Filter to adults (age 18+):

- name: "adult_cdap"
  data_source: "persons"
  filter: "age >= 18"
  group_by: "cdap"

Filter to specific county:

- name: "sf_auto_ownership"
  data_source: "households"
  filter: "county_name == 'San Francisco'"
  group_by: "num_vehicles"

Complex Filters¶

Multiple conditions (AND):

filter: "age >= 18 and age <= 65"

Multiple conditions (OR):

filter: "tour_purpose == 'Work' or tour_purpose == 'School'"

List membership:

filter: "tour_purpose in ['Work', 'School', 'University']"

Numeric ranges:

filter: "tour_distance >= 5 and tour_distance < 20"

String matching:

filter: "county_name.str.contains('San')"

Filter Execution¶

Filters are applied before aggregation:

Load data
Apply filter → Reduce rows
Group by dimensions
Calculate counts and shares

Log output:

INFO - ✓ Applied filter: 'tour_purpose == "Work"' (1,500,000 → 450,000 rows)

Binning Continuous Variables¶

Convert continuous variables (distance, time, age, income) into categories.

Define Bins Globally¶

In validation_config.yaml:

binning_specs:
  tour_distance:
    bins: [0, 5, 10, 20, 50, 1000]
    labels: ['0-5', '5-10', '10-20', '20-50', '50+']

  worker_age:
    bins: [0, 25, 35, 45, 55, 65, 120]
    labels: ['<25', '25-34', '35-44', '45-54', '55-64', '65+']

Use Binned Columns¶

The system automatically creates {column}_bin columns:

- name: "tour_distance_distribution"
  data_source: "individual_tours"
  group_by: "tour_distance_bin"  # Uses binning_specs.tour_distance

Output:

tour_distance_bin,tours,share,dataset
0-5,250000,0.40,2023 TM2.2 v05
5-10,180000,0.29,2023 TM2.2 v05
10-20,120000,0.19,2023 TM2.2 v05
20-50,60000,0.10,2023 TM2.2 v05
50+,15000,0.02,2023 TM2.2 v05

Binning Configuration¶

Field	Description	Example
`bins`	Bin edges (inclusive lower, exclusive upper)	`[0, 5, 10, 20, 50, 1000]`
`labels`	Category labels (one less than bins)	`['0-5', '5-10', '10-20', '20-50', '50+']`

Bin behavior: - bins: [0, 5, 10] creates: [0, 5) and [5, 10) - First bin includes lower edge: [0, 5) → 0 ≤ x < 5 - Last bin includes upper edge: [50, 1000] → 50 ≤ x ≤ 1000

Pre-Configured Bins¶

The system includes these bin specifications:

Variable	Bins	Labels
`tour_distance`	0, 5, 10, 20, 50, 1000	0-5, 5-10, 10-20, 20-50, 50+
`trip_distance`	0, 1, 3, 5, 10, 20, 1000	0-1, 1-3, 3-5, 5-10, 10-20, 20+
`tour_duration`	0, 30, 60, 120, 240, 10000	0-30, 30-60, 60-120, 120-240, 240+
`trip_duration`	0, 5, 10, 15, 30, 60, 10000	0-5, 5-10, 10-15, 15-30, 30-60, 60+
`worker_age`	0, 25, 35, 45, 55, 65, 120	<25, 25-34, 35-44, 45-54, 55-64, 65+
`income_category`	0, 30000, 60000, 100000, 150000, 1000000000	<30K, 30-60K, 60-100K, 100-150K, 150K+

Aggregating Categories¶

Combine detailed categories into broader groups (e.g., 4+ person households, Auto/Transit/Active modes).

Define Aggregations Globally¶

In validation_config.yaml:

aggregation_specs:
  # Household size: 1, 2, 3, 4+ (match ACS categories)
  num_persons_agg:
    apply_to: ["num_persons"]
    mapping:
      1: 1
      2: 2
      3: 3
      4: 4
      5: 4
      6: 4
      7: 4
      8: 4
      9: 4
      10: 4

Use Aggregated Columns¶

The system automatically creates {column}_agg columns:

- name: "auto_ownership_by_household_size_acs"
  data_source: "households"
  group_by: ["num_persons_agg", "num_vehicles"]  # Uses aggregation_specs.num_persons_agg
  share_within: "num_persons_agg"

Before aggregation (num_persons): - 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 (10 categories)

After aggregation (num_persons_agg): - 1, 2, 3, 4 (4 categories, where 4 = "4+")

Transportation Mode Aggregation¶

Group 17 detailed modes into broader categories:

aggregation_specs:
  transportation_mode:
    apply_to: ["tour_mode", "trip_mode"]
    mapping:
      1: "Auto - SOV"        # SOV_GP
      2: "Auto - SOV"        # SOV_PAY
      3: "Auto - Shared"     # SR2_GP
      4: "Auto - Shared"     # SR2_HOV
      5: "Auto - Shared"     # SR2_PAY
      6: "Auto - Shared"     # SR3_GP
      7: "Auto - Shared"     # SR3_HOV
      8: "Auto - Shared"     # SR3_PAY
      9: "Active"            # WALK
      10: "Active"           # BIKE
      11: "Transit"          # WLK_TRN
      12: "Transit"          # PNR_TRN
      13: "Transit"          # KNRPRV_TRN
      14: "Transit"          # KNRTNC_TRN
      15: "TNC/Taxi"         # TAXI
      16: "TNC/Taxi"         # TNC
      17: "School Bus"       # SCHLBUS

Usage:

- name: "tour_mode_choice_aggregated"
  data_source: "individual_tours"
  group_by: "tour_mode_agg"  # Uses transportation_mode aggregation

Output: 6 categories instead of 17

Aggregation vs. Binning¶

	Binning	Aggregation
Input	Continuous numeric	Categorical (codes/text)
Process	Create ranges	Map values to groups
Example	Distance → bins	17 modes → 6 groups
Suffix	`_bin`	`_agg`

Complete Examples¶

Example 1: Simple Distribution¶

Goal: Vehicle ownership distribution

- name: "auto_ownership_regional"
  description: "Regional vehicle ownership distribution"
  data_source: "households"
  group_by: "num_vehicles"
  weight_field: "sample_rate"
  count_name: "households"

Output:

num_vehicles,households,share,dataset
0,150000,0.25,2023 TM2.2 v05
1,200000,0.33,2023 TM2.2 v05
2,180000,0.30,2023 TM2.2 v05
3,72000,0.12,2023 TM2.2 v05

Example 2: Cross-Tabulation with Shares¶

Goal: Mode choice by tour purpose

- name: "tour_mode_by_purpose"
  description: "Tour mode by purpose cross-tabulation"
  data_source: "individual_tours"
  group_by: ["tour_purpose", "tour_mode"]
  weight_field: "sample_rate"
  count_name: "tours"
  share_within: "tour_purpose"

Output:

tour_purpose,tour_mode,tours,share,dataset
Work,1,80000,0.40,2023 TM2.2 v05
Work,9,30000,0.15,2023 TM2.2 v05
Work,11,60000,0.30,2023 TM2.2 v05
Shop,1,120000,0.60,2023 TM2.2 v05
Shop,9,80000,0.40,2023 TM2.2 v05

Shares add to 1.0 within each tour_purpose.

Example 3: Filtered + Binned¶

Goal: Work tour distance distribution (work tours only)

- name: "work_tour_distance"
  description: "Distance distribution for work tours"
  data_source: "individual_tours"
  filter: "tour_purpose == 'Work'"
  group_by: "tour_distance_bin"
  weight_field: "sample_rate"
  count_name: "tours"

Output: Distance distribution for work tours only (non-work filtered out)

Example 4: Multi-Level Grouping¶

Goal: Auto ownership by income and household size

- name: "auto_ownership_by_income_and_size"
  description: "Vehicle ownership by income and household size"
  data_source: "households"
  group_by: ["income_category", "num_persons", "num_vehicles"]
  weight_field: "sample_rate"
  count_name: "households"
  share_within: ["income_category", "num_persons"]

Output: Shares add to 1.0 within each (income, household size) combination

Example 5: Aggregated Categories¶

Goal: Simplified mode choice (Auto/Transit/Active/Other)

- name: "tour_mode_choice_aggregated"
  description: "Tour mode choice (aggregated categories)"
  data_source: "individual_tours"
  group_by: "tour_mode_agg"
  weight_field: "sample_rate"
  count_name: "tours"

Output:

tour_mode_agg,tours,share,dataset
Auto - SOV,450000,0.45,2023 TM2.2 v05
Auto - Shared,200000,0.20,2023 TM2.2 v05
Transit,250000,0.25,2023 TM2.2 v05
Active,80000,0.08,2023 TM2.2 v05
TNC/Taxi,20000,0.02,2023 TM2.2 v05

Example 6: Geographic Summary¶

Goal: Auto ownership by county

- name: "auto_ownership_by_county"
  description: "Vehicle ownership by county"
  data_source: "households"
  group_by: ["county_name", "num_vehicles"]
  weight_field: "sample_rate"
  count_name: "households"
  share_within: "county_name"

Output: Vehicle distribution for each Bay Area county

Common Patterns¶

Pattern 1: Regional Distribution¶

group_by: "variable"
# No share_within → regional shares

Pattern 2: Conditional Distribution¶

group_by: ["condition", "variable"]
share_within: "condition"
# Share of 'variable' within each 'condition'

Pattern 3: Filtered Subset¶

filter: "expression"
group_by: "variable"
# Distribution for filtered records only

Pattern 4: Multi-Dimensional¶

group_by: ["dim1", "dim2", "dim3"]
share_within: ["dim1", "dim2"]
# Share within each (dim1, dim2) combination

Troubleshooting¶

Empty Output¶

WARNING - Generated my_summary: 0 rows

Causes: - Filter removed all records - Missing column in data - Column has all null values

Solutions: 1. Check filter expression 2. Verify column exists: see Data Model 3. Check for typos in column names

Missing Columns¶

KeyError: 'tour_distance_bin'

Solution: Add binning spec to binning_specs section for tour_distance

Wrong Data Source¶

WARNING - Column 'tour_mode' not found

Cause: Using data_source: "households" but tour_mode is in "individual_tours"

Solution: Use correct data source table

Shares Don't Add to 1.0¶

Cause: Using wrong share_within grouping

Example:

group_by: ["county", "income", "vehicles"]
share_within: "county"  # Wrong - shares within county, not county×income

Fix:

share_within: ["county", "income"]  # Shares within each county×income combination

Best Practices¶

Use descriptive names - auto_ownership_by_income not summary_1
Add descriptions - Helps future users understand purpose
Start simple - Test with single dimension before adding complexity
Check shares - Verify they add to 1.0 within expected groups
Use aggregations - For matching external data categories (ACS uses 4+ households)
Leverage bins - Pre-configured bins for common variables
Filter early - Reduces processing time
Test incrementally - Add one feature at a time

Validation¶

Check your summary configuration:

python validate_config.py --config validation_config.yaml

Checks: - Required fields present - Data source valid - Group-by columns exist - Filter syntax correct - Binning specs defined

Next Steps¶

Generate Summaries - Run the summary generation
External Data Reference - Location of observed data files
Data Model Reference - See available columns
Configuration Guide - YAML documentation

Create Custom Summaries¶

Overview¶

Basic Summary Structure¶

Minimal Example¶

Required Fields¶

Optional Fields¶

Data Sources¶

Grouping Dimensions¶

Single Dimension¶

Multiple Dimensions¶

Cross-Tabulations¶

Share Calculations¶

Regional Shares (Default)¶

Shares Within Groups¶

Multiple Grouping Levels¶

Disable Share Calculation¶

Filtering Data¶

Simple Filters¶

Complex Filters¶

Filter Execution¶

Binning Continuous Variables¶

Define Bins Globally¶

Use Binned Columns¶

Binning Configuration¶

Pre-Configured Bins¶

Aggregating Categories¶

Define Aggregations Globally¶

Use Aggregated Columns¶

Transportation Mode Aggregation¶

Aggregation vs. Binning¶

Complete Examples¶

Example 1: Simple Distribution¶

Example 2: Cross-Tabulation with Shares¶

Example 3: Filtered + Binned¶

Example 4: Multi-Level Grouping¶

Example 5: Aggregated Categories¶

Example 6: Geographic Summary¶

Common Patterns¶

Pattern 1: Regional Distribution¶

Pattern 2: Conditional Distribution¶

Pattern 3: Filtered Subset¶

Pattern 4: Multi-Dimensional¶

Troubleshooting¶

Empty Output¶

Missing Columns¶

Wrong Data Source¶

Shares Don't Add to 1.0¶

Best Practices¶

Validation¶

Next Steps¶