Extract Tours

processing.tours.extraction

Tour building module for travel diary survey processing.

This module implements a hierarchical tour extraction algorithm that processes linked trip data to identify and classify tours and subtours based on spatial and temporal patterns.

Algorithm

The tour building process follows a seven-phase pipeline:

1. Location Classification

Calculates haversine distances from trip endpoints to known locations (home, work, school) using person-specific coordinates
Classifies each trip origin/destination as HOME, WORK, SCHOOL, or OTHER based on configurable distance thresholds
Only matches work/school locations if person has those locations defined
Adds boolean flags: o_is_home, d_is_home, o_is_work, d_is_work, etc.

2. Home-Based Tour Identification

Sorts trips by person, day, and departure time
Identifies tour boundaries by detecting:
- Departures from home (o_is_home=True, d_is_home=False)
- Returns to home (o_is_home=False, d_is_home=True)
- Day boundaries (first trip of person-day)
Assigns sequential tour IDs within each person-day
Format: tour_id = (day_id * 100) + tour_sequence_number

3. Anchor Period Expansion (CRITICAL for subtours)

For tours visiting usual anchor locations (work, school), expands the "at anchor" period by finding first arrival and last departure
Uses pure Polars window functions to identify anchor periods
Prevents subtours from being detected during travel to/from anchor
Generalizable: supports work, school, or future anchor types

4. Anchor-Based Subtour Detection

Within expanded anchor periods, identifies subtours by detecting:
- Departures from anchor (o_at_anchor=True, d_at_anchor=False)
- Returns to anchor (o_at_anchor=False, d_at_anchor=True)
Assigns hierarchical subtour IDs
Format: subtour_id = (tour_id * 10) + subtour_sequence_number
Currently supports work-based subtours, extensible to school-based

5. Tour Attribute Aggregation

Groups trips by tour_id (and subtour_id for subtours)
Computes tour-level attributes from constituent trips:
- tour_purpose: Highest priority destination purpose (person-category specific hierarchy)
- tour_mode: Highest priority travel mode (from configurable mode hierarchy)
- origin_depart_time: First trip's departure time
- dest_arrive_time: Last trip's arrival time
- trip_count: Number of trips in tour
- stop_count: Number of intermediate stops (trip_count - 1)
Assigns half-tour classification:
- "outbound": Trips before primary destination
- "inbound": Trips after primary destination
- "subtour": Work-based subtour trips

6. Joint Tour Identification

If joint_trips data provided, identifies tours where all trips involve same group of travelers
Assigns joint_tour_id to tours with stable participant groups
Links tour-level joint travel to trip-level joint travel

7. Tour Validation and Correction

Validates tour structure consistency
Corrects data quality issues (e.g., inconsistent timing, missing values)
Adds tour_id and joint_tour_id to unlinked_trips for reference

Edge Case Handling is performed including

Incomplete tours (no return home at end of day)
Multi-day tours (spanning survey boundaries)
Missing work/school locations (null coordinates)
Non-sequential trip chains (spatial gaps)
Hierarchical tour structure: Home-based tours → Work-based subtours
Location classification robust to GPS/geododing errors via distance thresholds
Tour purpose reflects primary activity, not intermediate stops
Extensible design allows future additions (school-based subtours, other anchor types)

extract_tours

extract_tours(
    persons: pl.DataFrame,
    households: pl.DataFrame,
    unlinked_trips: pl.DataFrame,
    linked_trips: pl.DataFrame,
    joint_trips: pl.DataFrame | None = None,
    **kwargs: dict[str, Any]
) -> dict[str, pl.DataFrame]

Extract hierarchical tour structures from linked trip data.

Builds tour and subtour structures from linked trip sequences using spatial and temporal patterns. See module docstring for complete algorithm description.

Parameters:

Name	Type	Description	Default
`persons`	`pl.DataFrame`	Person attributes including work/school locations. Used to identify anchor locations for tour/subtour detection.	required
`households`	`pl.DataFrame`	Household attributes including home locations. Home location is primary anchor for tour identification.	required
`unlinked_trips`	`pl.DataFrame`	Individual trip segments. Will receive tour_id assignment.	required
`linked_trips`	`pl.DataFrame`	Journey records with coordinates and timing. Required columns: person_id, day_id, o_lon, o_lat, d_lon, d_lat, depart_time, arrive_time.	required
`joint_trips`	`pl.DataFrame \| None`	Optional joint trip aggregations. If provided, enables joint tour identification based on stable participant groups.	`None`
`**kwargs`	`dict[str, Any]`	Configuration parameters for TourConfig: distance_thresholds: Dict of location type → distance threshold (meters). Default: {"home": 100, "work": 200, "school": 200} mode_hierarchy: Mode priority for tour mode assignment (list). Higher index = higher priority. purpose_hierarchy: Purpose priority by person type (dict). Maps person categories to ordered purpose lists. person_category_expression: Polars expression to classify person categories (e.g., worker, student).	`{}`

Returns:

Type	Description
`dict[str, pl.DataFrame]`	Dictionary containing: unlinked_trips: Original unlinked trips with tour_id, joint_tour_id linked_trips: Trips with tour_id, subtour_id, half_tour, joint_tour_id tours: Aggregated tour records with purpose, mode, timing, trip counts, and joint_tour_id