Load Data
Read/Write
processing.read_write.read_write
Data input/output operations for the survey processing pipeline.
This module provides pipeline steps for loading canonical survey tables from files and writing them to various output formats.
load_data
load_data(
input_paths: dict[str, str],
) -> dict[str, pl.DataFrame | gpd.GeoDataFrame]
Load canonical survey tables from file paths into memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_paths
|
dict[str, str]
|
Dictionary mapping table names to file paths. Supported formats: CSV, TSV, Parquet, Shapefile, GeoJSON. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, pl.DataFrame | gpd.GeoDataFrame]
|
Dictionary of table names to DataFrames (pl.DataFrame or gpd.GeoDataFrame). |
dict[str, pl.DataFrame | gpd.GeoDataFrame]
|
Typical tables include households, persons, days, unlinked_trips, etc. |
Algorithm
- Iterate through each table name and file path in input_paths
- Validate file path exists, providing helpful error message with broken path component
- Load data based on file extension:
- .csv/.tsv → polars.read_csv()
- .parquet → polars.read_parquet()
- .shp/.shp.zip/.geojson → geopandas.read_file()
- Return dictionary of loaded tables
Notes
- All CSV/Parquet files loaded as Polars DataFrames for performance
- Geospatial files loaded as GeoPandas GeoDataFrames
- Path validation helps diagnose configuration errors
write_data
write_data(
output_paths: dict[str, str],
canonical_data: CanonicalData,
validate_input: bool,
create_dirs: bool = True,
enum_codebook_path: str | None = None,
) -> None
Write canonical survey tables to output file paths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_paths
|
dict[str, str]
|
Dictionary mapping table names to output file paths. |
required |
canonical_data
|
CanonicalData
|
CanonicalData instance containing DataFrames to write. |
required |
validate_input
|
bool
|
Whether to run validation before writing. |
required |
create_dirs
|
bool
|
Whether to create parent directories (default: True). |
True
|
enum_codebook_path
|
str | None
|
Optional path for an .xlsx enum codebook.
When provided, a workbook is written with one worksheet per
LabeledEnum found in the models for the written tables.
Each sheet contains |
None
|
Algorithm
- If validate_input=True, validate each table using canonical data models
- For each table in output_paths:
- Retrieve DataFrame from canonical_data
- Create parent directories if needed
- Write data based on file extension:
- .csv → DataFrame.write_csv()
- .parquet → DataFrame.write_parquet()
- .shp/.shp.zip/.geojson → GeoDataFrame.to_file()
- .txt → Path.write_text()
- If enum_codebook_path is set, discover all LabeledEnum types from the models for the written tables and write a codebook workbook.
- Log completion status
Notes
- Validation ensures output conforms to canonical data schemas
- Automatic directory creation prevents path errors
- Supports multiple output formats for flexibility