Skip to content

Load Data

Read/Write

processing.read_write.read_write

Data input/output operations for the survey processing pipeline.

This module provides pipeline steps for loading canonical survey tables from files and writing them to various output formats.

load_data

load_data(
    input_paths: dict[str, str],
) -> dict[str, pl.DataFrame | gpd.GeoDataFrame]

Load canonical survey tables from file paths into memory.

Parameters:

Name Type Description Default
input_paths dict[str, str]

Dictionary mapping table names to file paths. Supported formats: CSV, TSV, Parquet, Shapefile, GeoJSON.

required

Returns:

Type Description
dict[str, pl.DataFrame | gpd.GeoDataFrame]

Dictionary of table names to DataFrames (pl.DataFrame or gpd.GeoDataFrame).

dict[str, pl.DataFrame | gpd.GeoDataFrame]

Typical tables include households, persons, days, unlinked_trips, etc.

Algorithm
  1. Iterate through each table name and file path in input_paths
  2. Validate file path exists, providing helpful error message with broken path component
  3. Load data based on file extension:
    • .csv/.tsv → polars.read_csv()
    • .parquet → polars.read_parquet()
    • .shp/.shp.zip/.geojson → geopandas.read_file()
  4. Return dictionary of loaded tables
Notes
  • All CSV/Parquet files loaded as Polars DataFrames for performance
  • Geospatial files loaded as GeoPandas GeoDataFrames
  • Path validation helps diagnose configuration errors

write_data

write_data(
    output_paths: dict[str, str],
    canonical_data: CanonicalData,
    validate_input: bool,
    create_dirs: bool = True,
    enum_codebook_path: str | None = None,
) -> None

Write canonical survey tables to output file paths.

Parameters:

Name Type Description Default
output_paths dict[str, str]

Dictionary mapping table names to output file paths.

required
canonical_data CanonicalData

CanonicalData instance containing DataFrames to write.

required
validate_input bool

Whether to run validation before writing.

required
create_dirs bool

Whether to create parent directories (default: True).

True
enum_codebook_path str | None

Optional path for an .xlsx enum codebook. When provided, a workbook is written with one worksheet per LabeledEnum found in the models for the written tables. Each sheet contains Value, Label, and Value Label columns. Must end in .xlsx.

None
Algorithm
  1. If validate_input=True, validate each table using canonical data models
  2. For each table in output_paths:
    • Retrieve DataFrame from canonical_data
    • Create parent directories if needed
    • Write data based on file extension:
      • .csv → DataFrame.write_csv()
      • .parquet → DataFrame.write_parquet()
      • .shp/.shp.zip/.geojson → GeoDataFrame.to_file()
      • .txt → Path.write_text()
  3. If enum_codebook_path is set, discover all LabeledEnum types from the models for the written tables and write a codebook workbook.
  4. Log completion status
Notes
  • Validation ensures output conforms to canonical data schemas
  • Automatic directory creation prevents path errors
  • Supports multiple output formats for flexibility