Imputation

processing.imputation

Data imputation for missing values in travel diary survey data.

This module provides configurable imputation using established statistical methods. It is designed as a pipeline step that operates on any combination of canonical survey tables (households, persons, days, trips, tours).

Supported methods

KNN -- single-column imputation via K-Nearest Neighbors similarity matching. Best for isolated missing fields where similar records exist.
Random Forest -- single-column imputation using a supervised RF model that auto-selects classifier vs. regressor based on the column type. Best for complex non-linear relationships or mixed feature types.
MICE -- multi-column imputation via Multiple Imputation by Chained Equations. Best for correlated variables (e.g. depart/arrive/duration).

Additional capabilities

Diagnostic flags -- optional boolean {column}_imputed columns that track which values were filled in.
Quality validation -- optional k-fold cross-validation that masks known values, re-imputes them, and reports accuracy / RMSE metrics.
Method comparison -- head-to-head benchmark of KNN, RF, and MICE on every imputed column using the same folds and feature sets.
Cross-table features -- join_tables pulls parent-table columns and auto-generates within-household mode features; aggregate_from pivots child rows up to a parent.

Typical pipeline position

steps:
  - name: load_data
  - name: custom_cleaning
  - name: imputation        # ← after cleaning, before linking
  - name: link_trips
  - name: joint_trips
  - name: extract_tours

Supported relationships for joining tables

These parent → child relationships are supported for join_tables

Child table	Parent table	Join key
persons	households	hh_id
days	persons / households	person_id / hh_id
unlinked_trips	days / persons / households	day_id / person_id / hh_id
linked_trips	days / persons / households	day_id / person_id / hh_id
tours	persons / households	person_id / hh_id

Missing-data assumptions

KNN assumes similar records (by feature distance) share similar values.
MICE assumes Missing At Random (MAR): missingness may depend on observed values but not on the missing value itself.
If data is Missing Not At Random (MNAR), results may be biased.

Current limitations

No stratified imputation (no group_by option for within-group models).
No support for exogenous data sources (PUMS, land use data).
High-cardinality one-hot encoding can slow MICE convergence -- move ordinal/count variables to numeric_features to mitigate.

References

all `module-attribute`

__all__ = ['imputation']

imputation

imputation(
    households: pl.DataFrame | None = None,
    persons: pl.DataFrame | None = None,
    days: pl.DataFrame | None = None,
    unlinked_trips: pl.DataFrame | None = None,
    linked_trips: pl.DataFrame | None = None,
    tours: pl.DataFrame | None = None,
    impute_columns: dict[str, list[dict[str, Any]]] | None = None,
    create_flags: bool = True,
    random_state: int | None = None,
    validate_imputation: dict[str, Any] | None = None,
) -> dict[str, pl.DataFrame]

Impute missing values using KNN, Random Forest, and/or MICE methods.

Each config block specifies its method (knn, rf, or mice) along with the method-specific parameters. Configs are grouped by method and executed in a fixed order (KNN → RF → MICE) across all tables so that later phases can benefit from values filled in earlier phases.

Handling Missing Values with Enum Labels

Survey data often uses special codes for missing values (e.g. 995 for "Missing Response", 999 for "Prefer not to answer"). Use enum member names (labels) rather than raw numeric values in the config:

missing_values: [MISSING, PNTA]   # enum labels, not 995/999

The module automatically:

Maps the table name to the appropriate codebook module (e.g. households → data_canon.codebook.households).
Finds the enum class whose canonical_field_name matches the target column (e.g. income_broad → IncomeBroad).
Resolves enum member names to their values (e.g. MISSING → 995).
Replaces those values with null before imputation.

For MICE with multiple columns, missing_values can be a dict mapping each column to its own labels, or a single list applied to all columns:

# Per-column
missing_values:
  race: [MISSING]
  ethnicity: [MISSING, PNTA]

# Shared
missing_values: [MISSING, PNTA]   # applied to all columns

Cross-Table Features

By default only features from the same table are used. Adding join_tables to a config block pulls columns from parent tables via left-join on known foreign keys, which can significantly improve quality.

Behaviour:

Columns from the specified parent table(s) are joined onto the child table (e.g. persons ← households via hh_id).
For each target column a hh_mode_{column} feature is auto-generated — the mode of that column among other household members (exclude-self). This captures within-household correlation (e.g. siblings sharing race/ethnicity).
Auto-generated hh_mode_* columns are appended to categorical_features automatically.
After imputation all joined/aggregated columns are stripped; the output schema is unchanged.

Example

impute_columns: persons: - method: knn column: gender n_neighbors: 5 join_tables: [households] categorical_features: [age, employment, income_bin, residence_type] # ^^^^^^^^^^ ^^^^^^^^^^^^^^ # columns from the households table

Child-to-Parent Aggregation

Using the config, aggregate_from, this is the reverse of join_tables: aggregate child rows up to a parent table. Useful when imputing parent-level fields that depend on household composition (e.g. predicting household income from the employment/education mix of its members).

For each child table and each field listed under pivot_count, the module groups child rows by the parent's FK and creates one column per unique value, counting occurrences. Generated columns are named {child_table}_count_{field}_{value} and are automatically added to numeric_features. After imputation, all generated columns are stripped.

Example

impute_columns: households: - method: mice columns: [income_bin] aggregate_from: persons: pivot_count: [employment, education, student] categorical_features: [residence_type, residence_rent_own] max_iter: 10

Parameters:

Name	Type	Description	Default
`households`	`pl.DataFrame \| None`	Households table (optional).	`None`
`persons`	`pl.DataFrame \| None`	Persons table (optional).	`None`
`days`	`pl.DataFrame \| None`	Days table (optional).	`None`
`unlinked_trips`	`pl.DataFrame \| None`	Unlinked trips table (optional).	`None`
`linked_trips`	`pl.DataFrame \| None`	Linked trips table (optional).	`None`
`tours`	`pl.DataFrame \| None`	Tours table (optional).	`None`
`impute_columns`	`dict[str, list[dict[str, Any]]] \| None`	Dict mapping table names to list of imputation configs. Every config dict must include a `method` key (`knn`, `rf`, or `mice`). The remaining keys are method-specific: KNN (`method: knn`): column: Column name to impute. missing_values: Enum labels to treat as missing. n_neighbors: Number of neighbors (default: 5). neighbor_weights: `'uniform'` or `'distance'` (default: `'distance'`). numeric_features: Numeric feature columns. categorical_features: Categorical feature columns. join_tables: Parent tables to left-join for extra features. aggregate_from: Child-to-parent pivot-count config. Random Forest (`method: rf`): column: Column name to impute. missing_values: Enum labels to treat as missing. n_estimators: Number of trees (default: 100). max_depth: Maximum tree depth (default: None, unlimited). numeric_features: Numeric feature columns. categorical_features: Categorical feature columns. join_tables: Parent tables to left-join for extra features. aggregate_from: Child-to-parent pivot-count config. MICE (`method: mice`): columns: Column names to impute together. missing_values: Dict mapping column → enum labels, or a single list applied to all columns. max_iter: Maximum iterations (default: 10). numeric_features: Numeric feature columns. categorical_features: Categorical feature columns. join_tables: Parent tables to left-join for extra features. aggregate_from: Child-to-parent pivot-count config. At least one of `numeric_features` or `categorical_features` is required in every config block.	`None`
`create_flags`	`bool`	Whether to create `{column}_imputed` boolean flag columns (default: True).	`True`
`random_state`	`int \| None`	Random seed for reproducibility across all imputation.	`None`
`validate_imputation`	`dict[str, Any] \| None`	Optional validation config with keys: enabled: Whether to run validation (default: False). n_folds: Number of k-folds (default: 5). sample_pct: Percentage of non-missing values to test (default: 5.0). output_path: Path to save validation or comparison CSV. compare_methods: When True, run all three methods (KNN, RF, MICE) against every column instead of validating only the configured method (default: False).	`None`

Returns:

Type	Description
`dict[str, pl.DataFrame]`	Dictionary of imputed tables. When validation is enabled, an extra
`dict[str, pl.DataFrame]`	key `_validation_summary` contains a Polars DataFrame with columns:
`dict[str, pl.DataFrame]`	table, variable, method, type, n_samples, n_folds, accuracy,
`dict[str, pl.DataFrame]`	precision, recall, f1, rmse, mae, r2.
`dict[str, pl.DataFrame]`	When `compare_methods` is True, an extra key
`dict[str, pl.DataFrame]`	`_method_comparison` contains a Polars DataFrame comparing
`dict[str, pl.DataFrame]`	KNN, RF, and MICE for every imputed column.

Example config

.. code-block:: yaml

impute_columns:
  households:
    - method: knn
      column: income_broad
      missing_values: [MISSING, PNTA]
      n_neighbors: 5
      neighbor_weights: distance
      numeric_features: [num_persons, num_vehicles, num_workers]
  persons:
    - method: knn
      column: gender
      missing_values: [MISSING]
      n_neighbors: 5
      join_tables: [households]
      numeric_features: [age]
      categorical_features: [relationship, employment, income_bin]
    - method: rf
      column: education
      missing_values: [MISSING]
      n_estimators: 200
      max_depth: 15
      numeric_features: [age]
      categorical_features: [employment, occupation]
    - method: mice
      columns: [race, ethnicity]
      missing_values:
        race: [MISSING]
        ethnicity: [MISSING, PNTA]
      join_tables: [households]
      max_iter: 10
      numeric_features: [age]
random_state: 42
create_flags: true
validate_imputation:
  enabled: true
  n_folds: 5
  sample_pct: 5.0

processing.imputation.generic_impute

Generic imputation step using KNN, Random Forest, and MICE methods.

imputation

imputation(
    households: pl.DataFrame | None = None,
    persons: pl.DataFrame | None = None,
    days: pl.DataFrame | None = None,
    unlinked_trips: pl.DataFrame | None = None,
    linked_trips: pl.DataFrame | None = None,
    tours: pl.DataFrame | None = None,
    impute_columns: dict[str, list[dict[str, Any]]] | None = None,
    create_flags: bool = True,
    random_state: int | None = None,
    validate_imputation: dict[str, Any] | None = None,
) -> dict[str, pl.DataFrame]

Impute missing values using KNN, Random Forest, and/or MICE methods.

Each config block specifies its method (knn, rf, or mice) along with the method-specific parameters. Configs are grouped by method and executed in a fixed order (KNN → RF → MICE) across all tables so that later phases can benefit from values filled in earlier phases.

Handling Missing Values with Enum Labels

Survey data often uses special codes for missing values (e.g. 995 for "Missing Response", 999 for "Prefer not to answer"). Use enum member names (labels) rather than raw numeric values in the config:

missing_values: [MISSING, PNTA]   # enum labels, not 995/999

The module automatically:

Maps the table name to the appropriate codebook module (e.g. households → data_canon.codebook.households).
Finds the enum class whose canonical_field_name matches the target column (e.g. income_broad → IncomeBroad).
Resolves enum member names to their values (e.g. MISSING → 995).
Replaces those values with null before imputation.

For MICE with multiple columns, missing_values can be a dict mapping each column to its own labels, or a single list applied to all columns:

# Per-column
missing_values:
  race: [MISSING]
  ethnicity: [MISSING, PNTA]

# Shared
missing_values: [MISSING, PNTA]   # applied to all columns

Cross-Table Features

By default only features from the same table are used. Adding join_tables to a config block pulls columns from parent tables via left-join on known foreign keys, which can significantly improve quality.

Behaviour:

Columns from the specified parent table(s) are joined onto the child table (e.g. persons ← households via hh_id).
For each target column a hh_mode_{column} feature is auto-generated — the mode of that column among other household members (exclude-self). This captures within-household correlation (e.g. siblings sharing race/ethnicity).
Auto-generated hh_mode_* columns are appended to categorical_features automatically.
After imputation all joined/aggregated columns are stripped; the output schema is unchanged.

Example

impute_columns: persons: - method: knn column: gender n_neighbors: 5 join_tables: [households] categorical_features: [age, employment, income_bin, residence_type] # ^^^^^^^^^^ ^^^^^^^^^^^^^^ # columns from the households table

Child-to-Parent Aggregation

Using the config, aggregate_from, this is the reverse of join_tables: aggregate child rows up to a parent table. Useful when imputing parent-level fields that depend on household composition (e.g. predicting household income from the employment/education mix of its members).

For each child table and each field listed under pivot_count, the module groups child rows by the parent's FK and creates one column per unique value, counting occurrences. Generated columns are named {child_table}_count_{field}_{value} and are automatically added to numeric_features. After imputation, all generated columns are stripped.

Example

impute_columns: households: - method: mice columns: [income_bin] aggregate_from: persons: pivot_count: [employment, education, student] categorical_features: [residence_type, residence_rent_own] max_iter: 10

Parameters:

Name	Type	Description	Default
`households`	`pl.DataFrame \| None`	Households table (optional).	`None`
`persons`	`pl.DataFrame \| None`	Persons table (optional).	`None`
`days`	`pl.DataFrame \| None`	Days table (optional).	`None`
`unlinked_trips`	`pl.DataFrame \| None`	Unlinked trips table (optional).	`None`
`linked_trips`	`pl.DataFrame \| None`	Linked trips table (optional).	`None`
`tours`	`pl.DataFrame \| None`	Tours table (optional).	`None`
`impute_columns`	`dict[str, list[dict[str, Any]]] \| None`	Dict mapping table names to list of imputation configs. Every config dict must include a `method` key (`knn`, `rf`, or `mice`). The remaining keys are method-specific: KNN (`method: knn`): column: Column name to impute. missing_values: Enum labels to treat as missing. n_neighbors: Number of neighbors (default: 5). neighbor_weights: `'uniform'` or `'distance'` (default: `'distance'`). numeric_features: Numeric feature columns. categorical_features: Categorical feature columns. join_tables: Parent tables to left-join for extra features. aggregate_from: Child-to-parent pivot-count config. Random Forest (`method: rf`): column: Column name to impute. missing_values: Enum labels to treat as missing. n_estimators: Number of trees (default: 100). max_depth: Maximum tree depth (default: None, unlimited). numeric_features: Numeric feature columns. categorical_features: Categorical feature columns. join_tables: Parent tables to left-join for extra features. aggregate_from: Child-to-parent pivot-count config. MICE (`method: mice`): columns: Column names to impute together. missing_values: Dict mapping column → enum labels, or a single list applied to all columns. max_iter: Maximum iterations (default: 10). numeric_features: Numeric feature columns. categorical_features: Categorical feature columns. join_tables: Parent tables to left-join for extra features. aggregate_from: Child-to-parent pivot-count config. At least one of `numeric_features` or `categorical_features` is required in every config block.	`None`
`create_flags`	`bool`	Whether to create `{column}_imputed` boolean flag columns (default: True).	`True`
`random_state`	`int \| None`	Random seed for reproducibility across all imputation.	`None`
`validate_imputation`	`dict[str, Any] \| None`	Optional validation config with keys: enabled: Whether to run validation (default: False). n_folds: Number of k-folds (default: 5). sample_pct: Percentage of non-missing values to test (default: 5.0). output_path: Path to save validation or comparison CSV. compare_methods: When True, run all three methods (KNN, RF, MICE) against every column instead of validating only the configured method (default: False).	`None`

Returns:

Type	Description
`dict[str, pl.DataFrame]`	Dictionary of imputed tables. When validation is enabled, an extra
`dict[str, pl.DataFrame]`	key `_validation_summary` contains a Polars DataFrame with columns:
`dict[str, pl.DataFrame]`	table, variable, method, type, n_samples, n_folds, accuracy,
`dict[str, pl.DataFrame]`	precision, recall, f1, rmse, mae, r2.
`dict[str, pl.DataFrame]`	When `compare_methods` is True, an extra key
`dict[str, pl.DataFrame]`	`_method_comparison` contains a Polars DataFrame comparing
`dict[str, pl.DataFrame]`	KNN, RF, and MICE for every imputed column.

Example config

.. code-block:: yaml

impute_columns:
  households:
    - method: knn
      column: income_broad
      missing_values: [MISSING, PNTA]
      n_neighbors: 5
      neighbor_weights: distance
      numeric_features: [num_persons, num_vehicles, num_workers]
  persons:
    - method: knn
      column: gender
      missing_values: [MISSING]
      n_neighbors: 5
      join_tables: [households]
      numeric_features: [age]
      categorical_features: [relationship, employment, income_bin]
    - method: rf
      column: education
      missing_values: [MISSING]
      n_estimators: 200
      max_depth: 15
      numeric_features: [age]
      categorical_features: [employment, occupation]
    - method: mice
      columns: [race, ethnicity]
      missing_values:
        race: [MISSING]
        ethnicity: [MISSING, PNTA]
      join_tables: [households]
      max_iter: 10
      numeric_features: [age]
random_state: 42
create_flags: true
validate_imputation:
  enabled: true
  n_folds: 5
  sample_pct: 5.0

processing.imputation.knn

KNN-based imputation for missing values.

impute_knn

impute_knn(
    df: pl.DataFrame,
    column: str,
    n_neighbors: int = 5,
    neighbor_weights: Literal["uniform", "distance"] = "distance",
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> tuple[pl.DataFrame, dict[str, Any]]

Impute missing values in a single column using K-Nearest Neighbors.

Best for: single columns with isolated missing values where similar records exist in the dataset.

How it works:

Build a feature matrix from numeric_features (used as-is) and categorical_features (one-hot encoded for distance calculation).
Non-contiguous integer codes (e.g. enum values 1, 2, 3, 995, 999) are automatically encoded to dense 0..N codes so they don't distort distance calculations, then decoded back after imputation.
For each row with a missing value, find the K most similar records based on Euclidean distance across all features.
Impute the missing value using the weighted average (or mode for categoricals) of the K neighbours.

neighbor_weights='distance' weights closer neighbours more heavily; neighbor_weights='uniform' treats all K neighbours equally.

Example use cases:

Missing trip mode when other trip attributes are known.
Missing person age when household/demographic info is available.
Missing trip distance when other spatial/temporal features exist.

Performance: O(n log n) complexity; scales well to medium-large datasets.

Parameters:

Name	Type	Description	Default
`df`	`pl.DataFrame`	DataFrame containing the column to impute.	required
`column`	`str`	Name of the column to impute.	required
`n_neighbors`	`int`	Number of similar records to use (default: 5).	`5`
`neighbor_weights`	`Literal['uniform', 'distance']`	`'distance'` or `'uniform'` (default: `'distance'`).	`'distance'`
`numeric_features`	`list[str] \| None`	Numeric/continuous feature columns. Used as-is.	`None`
`categorical_features`	`list[str] \| None`	Categorical feature columns. One-hot encoded into binary columns for distance calculation.	`None`

Returns:

Type	Description
`pl.DataFrame`	Tuple of (imputed DataFrame, stats dict). The stats dict contains
`dict[str, Any]`	`n_missing`, `n_imputed`, and `pct_imputed`.

processing.imputation.random_forest

Random Forest imputation for missing values.

impute_random_forest

impute_random_forest(
    df: pl.DataFrame,
    column: str,
    n_estimators: int = 100,
    max_depth: int | None = None,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> tuple[pl.DataFrame, dict[str, Any]]

Impute missing values in a single column using Random Forest.

Best for: single columns with complex non-linear relationships or mixed feature types where KNN may struggle with decision boundaries.

How it works:

Split rows into known (have a value) and missing (need imputation).
Train a Random Forest model on the known rows using all features.
Automatically select RandomForestClassifier for categorical targets (integer / string dtypes) or RandomForestRegressor for continuous targets (float dtypes).
Predict missing values using the trained model.
NaN values in features are filled with column medians before training.

Non-contiguous integer codes (e.g. enum values 1, 2, 3, 995, 999) are automatically encoded to dense 0..N codes so they don't distort the model, then decoded back after prediction.

Example use cases:

Missing education level when employment, occupation, and age are available.
Missing income category with many mixed-type predictors.
Cases where KNN struggles with non-linear decision boundaries.

Performance: trains on known values only; handles mixed types well but can be memory-intensive with many trees.

Parameters:

Name	Type	Description	Default
`df`	`pl.DataFrame`	DataFrame containing the column to impute.	required
`column`	`str`	Name of the column to impute.	required
`n_estimators`	`int`	Number of trees in the forest (default: 100).	`100`
`max_depth`	`int \| None`	Maximum tree depth (default: None = unlimited).	`None`
`random_state`	`int \| None`	Random seed for reproducibility.	`None`
`numeric_features`	`list[str] \| None`	Numeric/continuous feature columns.	`None`
`categorical_features`	`list[str] \| None`	Categorical feature columns (one-hot encoded).	`None`

Returns:

Type	Description
`pl.DataFrame`	Tuple of (imputed DataFrame, stats dict). The stats dict contains
`dict[str, Any]`	`n_missing`, `n_imputed`, and `pct_imputed`.

processing.imputation.mice

MICE-based imputation for missing values.

impute_mice

impute_mice(
    df: pl.DataFrame,
    columns: list[str],
    max_iter: int = 10,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
    verbose: bool = True,
) -> tuple[pl.DataFrame, dict[str, Any]]

Impute missing values in multiple correlated columns using MICE.

Best for: multiple correlated columns with missing values (e.g. depart_hour / arrive_hour / duration, or race / ethnicity).

MICE (Multiple Imputation by Chained Equations) imputes several variables together, preserving their joint distribution.

How it works:

Initialise missing values with simple imputation (mean/mode).
For each column with missing values:
1. Treat it as the target variable.
2. Use the other columns as predictors in a regression model.
3. Predict and update missing values.
Repeat iteratively until convergence (max_iter rounds).

Categorical integer columns (e.g. enum codes 1-6) are automatically encoded to dense 0..N codes before imputation and decoded back afterwards. String columns are auto-encoded to integers for the MICE model and decoded to original labels after imputation.

Assumes Missing At Random (MAR): missingness may depend on observed values but not on the missing value itself. If data is Missing Not At Random (MNAR), results may be biased.

Example use cases:

Time fields (depart_hour, arrive_hour, duration) — highly correlated.
Spatial coordinates (origin_lat, origin_lon) — spatially correlated.
Socio-demographic variables (income, education, employment) — often correlated.

Performance: iterative, can be slow for many columns or large datasets.

Parameters:

Name	Type	Description	Default
`df`	`pl.DataFrame`	DataFrame containing the columns to impute.	required
`columns`	`list[str]`	Column names to impute together.	required
`max_iter`	`int`	Maximum number of imputation rounds (default: 10).	`10`
`random_state`	`int \| None`	Random seed for reproducibility.	`None`
`numeric_features`	`list[str] \| None`	Numeric/continuous feature columns.	`None`
`categorical_features`	`list[str] \| None`	Categorical feature columns (one-hot encoded).	`None`
`verbose`	`bool`	Whether to log progress during imputation.	`True`

Returns:

Type	Description
`pl.DataFrame`	Tuple of (imputed DataFrame, stats dict). The stats dict is keyed
`dict[str, Any]`	by column name, each containing `n_missing`, `n_imputed`, and
`tuple[pl.DataFrame, dict[str, Any]]`	`pct_imputed`.

processing.imputation.comparison

Head-to-head comparison of imputation methods via k-fold cross-validation.

For each imputed column, every supported method (KNN, RF, MICE) is evaluated using the same k-fold splits and the same feature set. The result is a summary DataFrame that makes it easy to pick the best method for each field.

compare_imputation_methods

compare_imputation_methods(
    impute_columns: dict[str, list[dict[str, Any]]],
    tables: dict[str, pl.DataFrame],
    n_folds: int = 5,
    sample_pct: float = 5.0,
    random_state: int | None = None,
    output_path: str | None = None,
) -> pl.DataFrame

Run k-fold validation for every method on every imputed column.

For each unique (table, column) found in impute_columns, KNN, RF, and MICE are each evaluated using the same enrichment and feature set. This produces a comparison table that helps choose the best method per field.

Parameters:

Name	Type	Description	Default
`impute_columns`	`dict[str, list[dict[str, Any]]]`	The same config dict passed to `imputation()`.	required
`tables`	`dict[str, pl.DataFrame]`	Dict of canonical DataFrames (already cleaned).	required
`n_folds`	`int`	Number of cross-validation folds.	`5`
`sample_pct`	`float`	Percentage of non-missing values to test (0-100).	`5.0`
`random_state`	`int \| None`	Random seed for reproducibility.	`None`
`output_path`	`str \| None`	Optional path to save the comparison CSV.	`None`

Returns:

Type	Description
`pl.DataFrame`	Polars DataFrame with columns: table, variable, method, type,
`pl.DataFrame`	n_samples, n_folds, accuracy, precision, recall, f1, rmse, mae, r2.

processing.imputation.flags

Diagnostic flag columns for tracking imputed values.

When create_flags=True (the default), the imputation step creates a boolean column for every imputed field::

{column}_imputed   - True if the value was filled in, False otherwise

Examples: mode_imputed, distance_imputed, age_imputed.

Use cases:

Quality control: identify records that contain imputed values.
Sensitivity analysis: compare results with vs. without imputed records.
Downstream modelling: include imputation status as a feature.

create_flag_columns

create_flag_columns(
    df: pl.DataFrame, original_df: pl.DataFrame, columns: list[str]
) -> pl.DataFrame

Create boolean flag columns for multiple imputed columns.

Parameters:

Name	Type	Description	Default
`df`	`pl.DataFrame`	DataFrame with imputed values	required
`original_df`	`pl.DataFrame`	Original DataFrame before imputation	required
`columns`	`list[str]`	List of column names that were imputed	required

Returns:

Type	Description
`pl.DataFrame`	DataFrame with added flag columns named '{column}_imputed'

create_flag_column

create_flag_column(
    df: pl.DataFrame, original_df: pl.DataFrame, column: str
) -> pl.DataFrame

Create a boolean flag column indicating which values were imputed.

Parameters:

Name	Type	Description	Default
`df`	`pl.DataFrame`	DataFrame with imputed values	required
`original_df`	`pl.DataFrame`	Original DataFrame before imputation	required
`column`	`str`	Name of the column that was imputed	required

Returns:

Type	Description
`pl.DataFrame`	DataFrame with added flag column named '{column}_imputed'

processing.imputation.validation

K-fold cross-validation for imputation quality assessment.

Optional validation that assesses how accurate the imputation is by:

Sampling a percentage of non-missing values (user-configurable, e.g. 5%).
Artificially masking those values (setting them to null).
Imputing them using k-fold cross-validation.
Comparing imputed vs. actual values.
Computing and logging quality metrics.

Metrics by data type:

Categorical columns (e.g. mode, purpose): Accuracy, Precision, Recall, F1-Score.
Continuous columns (e.g. distance, duration): RMSE, MAE, R².

Configuration::

random_state: 42
validate_imputation:
  enabled: true
  n_folds: 5          # number of CV folds (default: 5)
  sample_pct: 5.0     # % of complete values to test (default: 5%)

Example output::

============================================================
Imputation Validation Results
============================================================

Column: mode (categorical, n=250 test samples)
  Accuracy:  0.876
  Precision: 0.883
  Recall:    0.876
  F1-Score:  0.872

Column: distance (continuous, n=250 test samples)
  RMSE: 2.34
  MAE:  1.82
  R²:   0.721
============================================================

Validation uses the same enrichment (joins, aggregations) as the real imputation pipeline, so metrics reflect the full feature set. Note that validation adds computational overhead (k-fold = k x the imputation time); it is recommended for development/testing and optional in production.

validate_knn_imputation

validate_knn_imputation(
    df: pl.DataFrame,
    column: str,
    n_folds: int,
    sample_pct: float,
    n_neighbors: int,
    neighbor_weights: Literal["uniform", "distance"],
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> dict[str, Any]

Validate KNN imputation quality using k-fold cross-validation.

validate_mice_imputation

validate_mice_imputation(
    df: pl.DataFrame,
    columns: list[str],
    n_folds: int,
    sample_pct: float,
    max_iter: int,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> dict[str, dict[str, Any]]

Validate MICE imputation quality using k-fold cross-validation.

validate_rf_imputation

validate_rf_imputation(
    df: pl.DataFrame,
    column: str,
    n_folds: int,
    sample_pct: float,
    n_estimators: int = 100,
    max_depth: int | None = None,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> dict[str, Any]

Validate Random Forest imputation quality using k-fold cross-validation.

log_validation_results

log_validation_results(
    metrics: dict[str, Any] | dict[str, dict[str, Any]],
) -> None

Log validation metrics in a readable format.

Parameters:

Name	Type	Description	Default
`metrics`	`dict[str, Any] \| dict[str, dict[str, Any]]`	Dictionary of metrics (single column or per-column dict)	required

Imputation

processing.imputation

Supported methods

Additional capabilities

Typical pipeline position

Supported relationships for joining tables

Missing-data assumptions

Current limitations

References

__all__ module-attribute

imputation

processing.imputation.generic_impute

imputation

processing.imputation.knn

impute_knn

processing.imputation.random_forest

impute_random_forest

processing.imputation.mice

impute_mice

processing.imputation.comparison

compare_imputation_methods

processing.imputation.flags

create_flag_columns

create_flag_column

processing.imputation.validation

validate_knn_imputation

validate_mice_imputation

validate_rf_imputation

log_validation_results

all `module-attribute`