Skip to content

Imputation

processing.imputation

Data imputation for missing values in travel diary survey data.

This module provides configurable imputation using established statistical methods. It is designed as a pipeline step that operates on any combination of canonical survey tables (households, persons, days, trips, tours).

Supported methods

  • KNN -- single-column imputation via K-Nearest Neighbors similarity matching. Best for isolated missing fields where similar records exist.
  • Random Forest -- single-column imputation using a supervised RF model that auto-selects classifier vs. regressor based on the column type. Best for complex non-linear relationships or mixed feature types.
  • MICE -- multi-column imputation via Multiple Imputation by Chained Equations. Best for correlated variables (e.g. depart/arrive/duration).

Additional capabilities

  • Diagnostic flags -- optional boolean {column}_imputed columns that track which values were filled in.
  • Quality validation -- optional k-fold cross-validation that masks known values, re-imputes them, and reports accuracy / RMSE metrics.
  • Method comparison -- head-to-head benchmark of KNN, RF, and MICE on every imputed column using the same folds and feature sets.
  • Cross-table features -- join_tables pulls parent-table columns and auto-generates within-household mode features; aggregate_from pivots child rows up to a parent.

Typical pipeline position

steps:
  - name: load_data
  - name: custom_cleaning
  - name: imputation        # ← after cleaning, before linking
  - name: link_trips
  - name: joint_trips
  - name: extract_tours

Supported relationships for joining tables

These parent → child relationships are supported for join_tables

Child table Parent table Join key
persons households hh_id
days persons / households person_id / hh_id
unlinked_trips days / persons / households day_id / person_id / hh_id
linked_trips days / persons / households day_id / person_id / hh_id
tours persons / households person_id / hh_id

Missing-data assumptions

  • KNN assumes similar records (by feature distance) share similar values.
  • MICE assumes Missing At Random (MAR): missingness may depend on observed values but not on the missing value itself.
  • If data is Missing Not At Random (MNAR), results may be biased.

Current limitations

  • No stratified imputation (no group_by option for within-group models).
  • No support for exogenous data sources (PUMS, land use data).
  • High-cardinality one-hot encoding can slow MICE convergence -- move ordinal/count variables to numeric_features to mitigate.

References

__all__ module-attribute

__all__ = ['imputation']

imputation

imputation(
    households: pl.DataFrame | None = None,
    persons: pl.DataFrame | None = None,
    days: pl.DataFrame | None = None,
    unlinked_trips: pl.DataFrame | None = None,
    linked_trips: pl.DataFrame | None = None,
    tours: pl.DataFrame | None = None,
    impute_columns: dict[str, list[dict[str, Any]]] | None = None,
    create_flags: bool = True,
    random_state: int | None = None,
    validate_imputation: dict[str, Any] | None = None,
) -> dict[str, pl.DataFrame]

Impute missing values using KNN, Random Forest, and/or MICE methods.

Each config block specifies its method (knn, rf, or mice) along with the method-specific parameters. Configs are grouped by method and executed in a fixed order (KNN → RF → MICE) across all tables so that later phases can benefit from values filled in earlier phases.

Handling Missing Values with Enum Labels

Survey data often uses special codes for missing values (e.g. 995 for "Missing Response", 999 for "Prefer not to answer"). Use enum member names (labels) rather than raw numeric values in the config:

missing_values: [MISSING, PNTA]   # enum labels, not 995/999

The module automatically:

  1. Maps the table name to the appropriate codebook module (e.g. householdsdata_canon.codebook.households).
  2. Finds the enum class whose canonical_field_name matches the target column (e.g. income_broadIncomeBroad).
  3. Resolves enum member names to their values (e.g. MISSING → 995).
  4. Replaces those values with null before imputation.

For MICE with multiple columns, missing_values can be a dict mapping each column to its own labels, or a single list applied to all columns:

# Per-column
missing_values:
  race: [MISSING]
  ethnicity: [MISSING, PNTA]

# Shared
missing_values: [MISSING, PNTA]   # applied to all columns
Cross-Table Features

By default only features from the same table are used. Adding join_tables to a config block pulls columns from parent tables via left-join on known foreign keys, which can significantly improve quality.

Behaviour:

  1. Columns from the specified parent table(s) are joined onto the child table (e.g. personshouseholds via hh_id).
  2. For each target column a hh_mode_{column} feature is auto-generated — the mode of that column among other household members (exclude-self). This captures within-household correlation (e.g. siblings sharing race/ethnicity).
  3. Auto-generated hh_mode_* columns are appended to categorical_features automatically.
  4. After imputation all joined/aggregated columns are stripped; the output schema is unchanged.
Example

impute_columns: persons: - method: knn column: gender n_neighbors: 5 join_tables: [households] categorical_features: [age, employment, income_bin, residence_type] # ^^^^^^^^^^ ^^^^^^^^^^^^^^ # columns from the households table

Child-to-Parent Aggregation

Using the config, aggregate_from, this is the reverse of join_tables: aggregate child rows up to a parent table. Useful when imputing parent-level fields that depend on household composition (e.g. predicting household income from the employment/education mix of its members).

For each child table and each field listed under pivot_count, the module groups child rows by the parent's FK and creates one column per unique value, counting occurrences. Generated columns are named {child_table}_count_{field}_{value} and are automatically added to numeric_features. After imputation, all generated columns are stripped.

Example

impute_columns: households: - method: mice columns: [income_bin] aggregate_from: persons: pivot_count: [employment, education, student] categorical_features: [residence_type, residence_rent_own] max_iter: 10

Parameters:

Name Type Description Default
households pl.DataFrame | None

Households table (optional).

None
persons pl.DataFrame | None

Persons table (optional).

None
days pl.DataFrame | None

Days table (optional).

None
unlinked_trips pl.DataFrame | None

Unlinked trips table (optional).

None
linked_trips pl.DataFrame | None

Linked trips table (optional).

None
tours pl.DataFrame | None

Tours table (optional).

None
impute_columns dict[str, list[dict[str, Any]]] | None

Dict mapping table names to list of imputation configs. Every config dict must include a method key (knn, rf, or mice). The remaining keys are method-specific:

KNN (method: knn):

  • column: Column name to impute.
  • missing_values: Enum labels to treat as missing.
  • n_neighbors: Number of neighbors (default: 5).
  • neighbor_weights: 'uniform' or 'distance' (default: 'distance').
  • numeric_features: Numeric feature columns.
  • categorical_features: Categorical feature columns.
  • join_tables: Parent tables to left-join for extra features.
  • aggregate_from: Child-to-parent pivot-count config.

Random Forest (method: rf):

  • column: Column name to impute.
  • missing_values: Enum labels to treat as missing.
  • n_estimators: Number of trees (default: 100).
  • max_depth: Maximum tree depth (default: None, unlimited).
  • numeric_features: Numeric feature columns.
  • categorical_features: Categorical feature columns.
  • join_tables: Parent tables to left-join for extra features.
  • aggregate_from: Child-to-parent pivot-count config.

MICE (method: mice):

  • columns: Column names to impute together.
  • missing_values: Dict mapping column → enum labels, or a single list applied to all columns.
  • max_iter: Maximum iterations (default: 10).
  • numeric_features: Numeric feature columns.
  • categorical_features: Categorical feature columns.
  • join_tables: Parent tables to left-join for extra features.
  • aggregate_from: Child-to-parent pivot-count config.

At least one of numeric_features or categorical_features is required in every config block.

None
create_flags bool

Whether to create {column}_imputed boolean flag columns (default: True).

True
random_state int | None

Random seed for reproducibility across all imputation.

None
validate_imputation dict[str, Any] | None

Optional validation config with keys:

  • enabled: Whether to run validation (default: False).
  • n_folds: Number of k-folds (default: 5).
  • sample_pct: Percentage of non-missing values to test (default: 5.0).
  • output_path: Path to save validation or comparison CSV.
  • compare_methods: When True, run all three methods (KNN, RF, MICE) against every column instead of validating only the configured method (default: False).
None

Returns:

Type Description
dict[str, pl.DataFrame]

Dictionary of imputed tables. When validation is enabled, an extra

dict[str, pl.DataFrame]

key _validation_summary contains a Polars DataFrame with columns:

dict[str, pl.DataFrame]

table, variable, method, type, n_samples, n_folds, accuracy,

dict[str, pl.DataFrame]

precision, recall, f1, rmse, mae, r2.

dict[str, pl.DataFrame]

When compare_methods is True, an extra key

dict[str, pl.DataFrame]

_method_comparison contains a Polars DataFrame comparing

dict[str, pl.DataFrame]

KNN, RF, and MICE for every imputed column.

Example config

.. code-block:: yaml

impute_columns:
  households:
    - method: knn
      column: income_broad
      missing_values: [MISSING, PNTA]
      n_neighbors: 5
      neighbor_weights: distance
      numeric_features: [num_persons, num_vehicles, num_workers]
  persons:
    - method: knn
      column: gender
      missing_values: [MISSING]
      n_neighbors: 5
      join_tables: [households]
      numeric_features: [age]
      categorical_features: [relationship, employment, income_bin]
    - method: rf
      column: education
      missing_values: [MISSING]
      n_estimators: 200
      max_depth: 15
      numeric_features: [age]
      categorical_features: [employment, occupation]
    - method: mice
      columns: [race, ethnicity]
      missing_values:
        race: [MISSING]
        ethnicity: [MISSING, PNTA]
      join_tables: [households]
      max_iter: 10
      numeric_features: [age]
random_state: 42
create_flags: true
validate_imputation:
  enabled: true
  n_folds: 5
  sample_pct: 5.0

processing.imputation.generic_impute

Generic imputation step using KNN, Random Forest, and MICE methods.

imputation

imputation(
    households: pl.DataFrame | None = None,
    persons: pl.DataFrame | None = None,
    days: pl.DataFrame | None = None,
    unlinked_trips: pl.DataFrame | None = None,
    linked_trips: pl.DataFrame | None = None,
    tours: pl.DataFrame | None = None,
    impute_columns: dict[str, list[dict[str, Any]]] | None = None,
    create_flags: bool = True,
    random_state: int | None = None,
    validate_imputation: dict[str, Any] | None = None,
) -> dict[str, pl.DataFrame]

Impute missing values using KNN, Random Forest, and/or MICE methods.

Each config block specifies its method (knn, rf, or mice) along with the method-specific parameters. Configs are grouped by method and executed in a fixed order (KNN → RF → MICE) across all tables so that later phases can benefit from values filled in earlier phases.

Handling Missing Values with Enum Labels

Survey data often uses special codes for missing values (e.g. 995 for "Missing Response", 999 for "Prefer not to answer"). Use enum member names (labels) rather than raw numeric values in the config:

missing_values: [MISSING, PNTA]   # enum labels, not 995/999

The module automatically:

  1. Maps the table name to the appropriate codebook module (e.g. householdsdata_canon.codebook.households).
  2. Finds the enum class whose canonical_field_name matches the target column (e.g. income_broadIncomeBroad).
  3. Resolves enum member names to their values (e.g. MISSING → 995).
  4. Replaces those values with null before imputation.

For MICE with multiple columns, missing_values can be a dict mapping each column to its own labels, or a single list applied to all columns:

# Per-column
missing_values:
  race: [MISSING]
  ethnicity: [MISSING, PNTA]

# Shared
missing_values: [MISSING, PNTA]   # applied to all columns
Cross-Table Features

By default only features from the same table are used. Adding join_tables to a config block pulls columns from parent tables via left-join on known foreign keys, which can significantly improve quality.

Behaviour:

  1. Columns from the specified parent table(s) are joined onto the child table (e.g. personshouseholds via hh_id).
  2. For each target column a hh_mode_{column} feature is auto-generated — the mode of that column among other household members (exclude-self). This captures within-household correlation (e.g. siblings sharing race/ethnicity).
  3. Auto-generated hh_mode_* columns are appended to categorical_features automatically.
  4. After imputation all joined/aggregated columns are stripped; the output schema is unchanged.
Example

impute_columns: persons: - method: knn column: gender n_neighbors: 5 join_tables: [households] categorical_features: [age, employment, income_bin, residence_type] # ^^^^^^^^^^ ^^^^^^^^^^^^^^ # columns from the households table

Child-to-Parent Aggregation

Using the config, aggregate_from, this is the reverse of join_tables: aggregate child rows up to a parent table. Useful when imputing parent-level fields that depend on household composition (e.g. predicting household income from the employment/education mix of its members).

For each child table and each field listed under pivot_count, the module groups child rows by the parent's FK and creates one column per unique value, counting occurrences. Generated columns are named {child_table}_count_{field}_{value} and are automatically added to numeric_features. After imputation, all generated columns are stripped.

Example

impute_columns: households: - method: mice columns: [income_bin] aggregate_from: persons: pivot_count: [employment, education, student] categorical_features: [residence_type, residence_rent_own] max_iter: 10

Parameters:

Name Type Description Default
households pl.DataFrame | None

Households table (optional).

None
persons pl.DataFrame | None

Persons table (optional).

None
days pl.DataFrame | None

Days table (optional).

None
unlinked_trips pl.DataFrame | None

Unlinked trips table (optional).

None
linked_trips pl.DataFrame | None

Linked trips table (optional).

None
tours pl.DataFrame | None

Tours table (optional).

None
impute_columns dict[str, list[dict[str, Any]]] | None

Dict mapping table names to list of imputation configs. Every config dict must include a method key (knn, rf, or mice). The remaining keys are method-specific:

KNN (method: knn):

  • column: Column name to impute.
  • missing_values: Enum labels to treat as missing.
  • n_neighbors: Number of neighbors (default: 5).
  • neighbor_weights: 'uniform' or 'distance' (default: 'distance').
  • numeric_features: Numeric feature columns.
  • categorical_features: Categorical feature columns.
  • join_tables: Parent tables to left-join for extra features.
  • aggregate_from: Child-to-parent pivot-count config.

Random Forest (method: rf):

  • column: Column name to impute.
  • missing_values: Enum labels to treat as missing.
  • n_estimators: Number of trees (default: 100).
  • max_depth: Maximum tree depth (default: None, unlimited).
  • numeric_features: Numeric feature columns.
  • categorical_features: Categorical feature columns.
  • join_tables: Parent tables to left-join for extra features.
  • aggregate_from: Child-to-parent pivot-count config.

MICE (method: mice):

  • columns: Column names to impute together.
  • missing_values: Dict mapping column → enum labels, or a single list applied to all columns.
  • max_iter: Maximum iterations (default: 10).
  • numeric_features: Numeric feature columns.
  • categorical_features: Categorical feature columns.
  • join_tables: Parent tables to left-join for extra features.
  • aggregate_from: Child-to-parent pivot-count config.

At least one of numeric_features or categorical_features is required in every config block.

None
create_flags bool

Whether to create {column}_imputed boolean flag columns (default: True).

True
random_state int | None

Random seed for reproducibility across all imputation.

None
validate_imputation dict[str, Any] | None

Optional validation config with keys:

  • enabled: Whether to run validation (default: False).
  • n_folds: Number of k-folds (default: 5).
  • sample_pct: Percentage of non-missing values to test (default: 5.0).
  • output_path: Path to save validation or comparison CSV.
  • compare_methods: When True, run all three methods (KNN, RF, MICE) against every column instead of validating only the configured method (default: False).
None

Returns:

Type Description
dict[str, pl.DataFrame]

Dictionary of imputed tables. When validation is enabled, an extra

dict[str, pl.DataFrame]

key _validation_summary contains a Polars DataFrame with columns:

dict[str, pl.DataFrame]

table, variable, method, type, n_samples, n_folds, accuracy,

dict[str, pl.DataFrame]

precision, recall, f1, rmse, mae, r2.

dict[str, pl.DataFrame]

When compare_methods is True, an extra key

dict[str, pl.DataFrame]

_method_comparison contains a Polars DataFrame comparing

dict[str, pl.DataFrame]

KNN, RF, and MICE for every imputed column.

Example config

.. code-block:: yaml

impute_columns:
  households:
    - method: knn
      column: income_broad
      missing_values: [MISSING, PNTA]
      n_neighbors: 5
      neighbor_weights: distance
      numeric_features: [num_persons, num_vehicles, num_workers]
  persons:
    - method: knn
      column: gender
      missing_values: [MISSING]
      n_neighbors: 5
      join_tables: [households]
      numeric_features: [age]
      categorical_features: [relationship, employment, income_bin]
    - method: rf
      column: education
      missing_values: [MISSING]
      n_estimators: 200
      max_depth: 15
      numeric_features: [age]
      categorical_features: [employment, occupation]
    - method: mice
      columns: [race, ethnicity]
      missing_values:
        race: [MISSING]
        ethnicity: [MISSING, PNTA]
      join_tables: [households]
      max_iter: 10
      numeric_features: [age]
random_state: 42
create_flags: true
validate_imputation:
  enabled: true
  n_folds: 5
  sample_pct: 5.0

processing.imputation.knn

KNN-based imputation for missing values.

impute_knn

impute_knn(
    df: pl.DataFrame,
    column: str,
    n_neighbors: int = 5,
    neighbor_weights: Literal["uniform", "distance"] = "distance",
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> tuple[pl.DataFrame, dict[str, Any]]

Impute missing values in a single column using K-Nearest Neighbors.

Best for: single columns with isolated missing values where similar records exist in the dataset.

How it works:

  1. Build a feature matrix from numeric_features (used as-is) and categorical_features (one-hot encoded for distance calculation).
  2. Non-contiguous integer codes (e.g. enum values 1, 2, 3, 995, 999) are automatically encoded to dense 0..N codes so they don't distort distance calculations, then decoded back after imputation.
  3. For each row with a missing value, find the K most similar records based on Euclidean distance across all features.
  4. Impute the missing value using the weighted average (or mode for categoricals) of the K neighbours.

neighbor_weights='distance' weights closer neighbours more heavily; neighbor_weights='uniform' treats all K neighbours equally.

Example use cases:

  • Missing trip mode when other trip attributes are known.
  • Missing person age when household/demographic info is available.
  • Missing trip distance when other spatial/temporal features exist.

Performance: O(n log n) complexity; scales well to medium-large datasets.

Parameters:

Name Type Description Default
df pl.DataFrame

DataFrame containing the column to impute.

required
column str

Name of the column to impute.

required
n_neighbors int

Number of similar records to use (default: 5).

5
neighbor_weights Literal['uniform', 'distance']

'distance' or 'uniform' (default: 'distance').

'distance'
numeric_features list[str] | None

Numeric/continuous feature columns. Used as-is.

None
categorical_features list[str] | None

Categorical feature columns. One-hot encoded into binary columns for distance calculation.

None

Returns:

Type Description
pl.DataFrame

Tuple of (imputed DataFrame, stats dict). The stats dict contains

dict[str, Any]

n_missing, n_imputed, and pct_imputed.

processing.imputation.random_forest

Random Forest imputation for missing values.

impute_random_forest

impute_random_forest(
    df: pl.DataFrame,
    column: str,
    n_estimators: int = 100,
    max_depth: int | None = None,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> tuple[pl.DataFrame, dict[str, Any]]

Impute missing values in a single column using Random Forest.

Best for: single columns with complex non-linear relationships or mixed feature types where KNN may struggle with decision boundaries.

How it works:

  1. Split rows into known (have a value) and missing (need imputation).
  2. Train a Random Forest model on the known rows using all features.
  3. Automatically select RandomForestClassifier for categorical targets (integer / string dtypes) or RandomForestRegressor for continuous targets (float dtypes).
  4. Predict missing values using the trained model.
  5. NaN values in features are filled with column medians before training.

Non-contiguous integer codes (e.g. enum values 1, 2, 3, 995, 999) are automatically encoded to dense 0..N codes so they don't distort the model, then decoded back after prediction.

Example use cases:

  • Missing education level when employment, occupation, and age are available.
  • Missing income category with many mixed-type predictors.
  • Cases where KNN struggles with non-linear decision boundaries.

Performance: trains on known values only; handles mixed types well but can be memory-intensive with many trees.

Parameters:

Name Type Description Default
df pl.DataFrame

DataFrame containing the column to impute.

required
column str

Name of the column to impute.

required
n_estimators int

Number of trees in the forest (default: 100).

100
max_depth int | None

Maximum tree depth (default: None = unlimited).

None
random_state int | None

Random seed for reproducibility.

None
numeric_features list[str] | None

Numeric/continuous feature columns.

None
categorical_features list[str] | None

Categorical feature columns (one-hot encoded).

None

Returns:

Type Description
pl.DataFrame

Tuple of (imputed DataFrame, stats dict). The stats dict contains

dict[str, Any]

n_missing, n_imputed, and pct_imputed.

processing.imputation.mice

MICE-based imputation for missing values.

impute_mice

impute_mice(
    df: pl.DataFrame,
    columns: list[str],
    max_iter: int = 10,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
    verbose: bool = True,
) -> tuple[pl.DataFrame, dict[str, Any]]

Impute missing values in multiple correlated columns using MICE.

Best for: multiple correlated columns with missing values (e.g. depart_hour / arrive_hour / duration, or race / ethnicity).

MICE (Multiple Imputation by Chained Equations) imputes several variables together, preserving their joint distribution.

How it works:

  1. Initialise missing values with simple imputation (mean/mode).
  2. For each column with missing values:

    1. Treat it as the target variable.
    2. Use the other columns as predictors in a regression model.
    3. Predict and update missing values.
  3. Repeat iteratively until convergence (max_iter rounds).

Categorical integer columns (e.g. enum codes 1-6) are automatically encoded to dense 0..N codes before imputation and decoded back afterwards. String columns are auto-encoded to integers for the MICE model and decoded to original labels after imputation.

Assumes Missing At Random (MAR): missingness may depend on observed values but not on the missing value itself. If data is Missing Not At Random (MNAR), results may be biased.

Example use cases:

  • Time fields (depart_hour, arrive_hour, duration) — highly correlated.
  • Spatial coordinates (origin_lat, origin_lon) — spatially correlated.
  • Socio-demographic variables (income, education, employment) — often correlated.

Performance: iterative, can be slow for many columns or large datasets.

Parameters:

Name Type Description Default
df pl.DataFrame

DataFrame containing the columns to impute.

required
columns list[str]

Column names to impute together.

required
max_iter int

Maximum number of imputation rounds (default: 10).

10
random_state int | None

Random seed for reproducibility.

None
numeric_features list[str] | None

Numeric/continuous feature columns.

None
categorical_features list[str] | None

Categorical feature columns (one-hot encoded).

None
verbose bool

Whether to log progress during imputation.

True

Returns:

Type Description
pl.DataFrame

Tuple of (imputed DataFrame, stats dict). The stats dict is keyed

dict[str, Any]

by column name, each containing n_missing, n_imputed, and

tuple[pl.DataFrame, dict[str, Any]]

pct_imputed.

processing.imputation.comparison

Head-to-head comparison of imputation methods via k-fold cross-validation.

For each imputed column, every supported method (KNN, RF, MICE) is evaluated using the same k-fold splits and the same feature set. The result is a summary DataFrame that makes it easy to pick the best method for each field.

compare_imputation_methods

compare_imputation_methods(
    impute_columns: dict[str, list[dict[str, Any]]],
    tables: dict[str, pl.DataFrame],
    n_folds: int = 5,
    sample_pct: float = 5.0,
    random_state: int | None = None,
    output_path: str | None = None,
) -> pl.DataFrame

Run k-fold validation for every method on every imputed column.

For each unique (table, column) found in impute_columns, KNN, RF, and MICE are each evaluated using the same enrichment and feature set. This produces a comparison table that helps choose the best method per field.

Parameters:

Name Type Description Default
impute_columns dict[str, list[dict[str, Any]]]

The same config dict passed to imputation().

required
tables dict[str, pl.DataFrame]

Dict of canonical DataFrames (already cleaned).

required
n_folds int

Number of cross-validation folds.

5
sample_pct float

Percentage of non-missing values to test (0-100).

5.0
random_state int | None

Random seed for reproducibility.

None
output_path str | None

Optional path to save the comparison CSV.

None

Returns:

Type Description
pl.DataFrame

Polars DataFrame with columns: table, variable, method, type,

pl.DataFrame

n_samples, n_folds, accuracy, precision, recall, f1, rmse, mae, r2.

processing.imputation.flags

Diagnostic flag columns for tracking imputed values.

When create_flags=True (the default), the imputation step creates a boolean column for every imputed field::

{column}_imputed   - True if the value was filled in, False otherwise

Examples: mode_imputed, distance_imputed, age_imputed.

Use cases:

  • Quality control: identify records that contain imputed values.
  • Sensitivity analysis: compare results with vs. without imputed records.
  • Downstream modelling: include imputation status as a feature.

create_flag_columns

create_flag_columns(
    df: pl.DataFrame, original_df: pl.DataFrame, columns: list[str]
) -> pl.DataFrame

Create boolean flag columns for multiple imputed columns.

Parameters:

Name Type Description Default
df pl.DataFrame

DataFrame with imputed values

required
original_df pl.DataFrame

Original DataFrame before imputation

required
columns list[str]

List of column names that were imputed

required

Returns:

Type Description
pl.DataFrame

DataFrame with added flag columns named '{column}_imputed'

create_flag_column

create_flag_column(
    df: pl.DataFrame, original_df: pl.DataFrame, column: str
) -> pl.DataFrame

Create a boolean flag column indicating which values were imputed.

Parameters:

Name Type Description Default
df pl.DataFrame

DataFrame with imputed values

required
original_df pl.DataFrame

Original DataFrame before imputation

required
column str

Name of the column that was imputed

required

Returns:

Type Description
pl.DataFrame

DataFrame with added flag column named '{column}_imputed'

processing.imputation.validation

K-fold cross-validation for imputation quality assessment.

Optional validation that assesses how accurate the imputation is by:

  1. Sampling a percentage of non-missing values (user-configurable, e.g. 5%).
  2. Artificially masking those values (setting them to null).
  3. Imputing them using k-fold cross-validation.
  4. Comparing imputed vs. actual values.
  5. Computing and logging quality metrics.

Metrics by data type:

  • Categorical columns (e.g. mode, purpose): Accuracy, Precision, Recall, F1-Score.
  • Continuous columns (e.g. distance, duration): RMSE, MAE, R².

Configuration::

random_state: 42
validate_imputation:
  enabled: true
  n_folds: 5          # number of CV folds (default: 5)
  sample_pct: 5.0     # % of complete values to test (default: 5%)

Example output::

============================================================
Imputation Validation Results
============================================================

Column: mode (categorical, n=250 test samples)
  Accuracy:  0.876
  Precision: 0.883
  Recall:    0.876
  F1-Score:  0.872

Column: distance (continuous, n=250 test samples)
  RMSE: 2.34
  MAE:  1.82
  R²:   0.721
============================================================

Validation uses the same enrichment (joins, aggregations) as the real imputation pipeline, so metrics reflect the full feature set. Note that validation adds computational overhead (k-fold = k x the imputation time); it is recommended for development/testing and optional in production.

validate_knn_imputation

validate_knn_imputation(
    df: pl.DataFrame,
    column: str,
    n_folds: int,
    sample_pct: float,
    n_neighbors: int,
    neighbor_weights: Literal["uniform", "distance"],
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> dict[str, Any]

Validate KNN imputation quality using k-fold cross-validation.

validate_mice_imputation

validate_mice_imputation(
    df: pl.DataFrame,
    columns: list[str],
    n_folds: int,
    sample_pct: float,
    max_iter: int,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> dict[str, dict[str, Any]]

Validate MICE imputation quality using k-fold cross-validation.

validate_rf_imputation

validate_rf_imputation(
    df: pl.DataFrame,
    column: str,
    n_folds: int,
    sample_pct: float,
    n_estimators: int = 100,
    max_depth: int | None = None,
    random_state: int | None = None,
    numeric_features: list[str] | None = None,
    categorical_features: list[str] | None = None,
) -> dict[str, Any]

Validate Random Forest imputation quality using k-fold cross-validation.

log_validation_results

log_validation_results(
    metrics: dict[str, Any] | dict[str, dict[str, Any]],
) -> None

Log validation metrics in a readable format.

Parameters:

Name Type Description Default
metrics dict[str, Any] | dict[str, dict[str, Any]]

Dictionary of metrics (single column or per-column dict)

required