Load Data

Read/Write

processing.read_write.read_write

Data input/output operations for the survey processing pipeline.

This module provides pipeline steps for loading canonical survey tables from files and writing them to various output formats.

load_data

load_data(
    input_paths: dict[str, str],
) -> dict[str, pl.DataFrame | gpd.GeoDataFrame]

Load canonical survey tables from file paths into memory.

Parameters:

Name	Type	Description	Default
`input_paths`	`dict[str, str]`	Dictionary mapping table names to file paths. Supported formats: CSV, TSV, Parquet, Shapefile, GeoJSON.	required

Returns:

Type	Description
`dict[str, pl.DataFrame \| gpd.GeoDataFrame]`	Dictionary of table names to DataFrames (pl.DataFrame or gpd.GeoDataFrame).
`dict[str, pl.DataFrame \| gpd.GeoDataFrame]`	Typical tables include households, persons, days, unlinked_trips, etc.

Algorithm

Iterate through each table name and file path in input_paths
Validate file path exists, providing helpful error message with broken path component
Load data based on file extension:
- .csv/.tsv → polars.read_csv()
- .parquet → polars.read_parquet()
- .shp/.shp.zip/.geojson → geopandas.read_file()
Return dictionary of loaded tables

Notes

All CSV/Parquet files loaded as Polars DataFrames for performance
Geospatial files loaded as GeoPandas GeoDataFrames
Path validation helps diagnose configuration errors

write_data

write_data(
    output_paths: dict[str, str],
    canonical_data: CanonicalData,
    validate_input: bool,
    create_dirs: bool = True,
    enum_codebook_path: str | None = None,
) -> None

Write canonical survey tables to output file paths.

Parameters:

Name	Type	Description	Default
`output_paths`	`dict[str, str]`	Dictionary mapping table names to output file paths.	required
`canonical_data`	`CanonicalData`	CanonicalData instance containing DataFrames to write.	required
`validate_input`	`bool`	Whether to run validation before writing.	required
`create_dirs`	`bool`	Whether to create parent directories (default: True).	`True`
`enum_codebook_path`	`str \| None`	Optional path for an .xlsx enum codebook. When provided, a workbook is written with one worksheet per LabeledEnum found in the models for the written tables. Each sheet contains `Value`, `Label`, and `Value Label` columns. Must end in `.xlsx`.	`None`

Algorithm

If validate_input=True, validate each table using canonical data models
For each table in output_paths:
- Retrieve DataFrame from canonical_data
- Create parent directories if needed
- Write data based on file extension:
  - .csv → DataFrame.write_csv()
  - .parquet → DataFrame.write_parquet()
  - .shp/.shp.zip/.geojson → GeoDataFrame.to_file()
  - .txt → Path.write_text()
If enum_codebook_path is set, discover all LabeledEnum types from the models for the written tables and write a codebook workbook.
Log completion status

Notes

Validation ensures output conforms to canonical data schemas
Automatic directory creation prevents path errors
Supports multiple output formats for flexibility