Imputation
processing.imputation
Data imputation for missing values in travel diary survey data.
This module provides configurable imputation using established statistical methods. It is designed as a pipeline step that operates on any combination of canonical survey tables (households, persons, days, trips, tours).
Supported methods
- KNN -- single-column imputation via K-Nearest Neighbors similarity matching. Best for isolated missing fields where similar records exist.
- Random Forest -- single-column imputation using a supervised RF model that auto-selects classifier vs. regressor based on the column type. Best for complex non-linear relationships or mixed feature types.
- MICE -- multi-column imputation via Multiple Imputation by Chained Equations. Best for correlated variables (e.g. depart/arrive/duration).
Additional capabilities
- Diagnostic flags -- optional boolean
{column}_imputedcolumns that track which values were filled in. - Quality validation -- optional k-fold cross-validation that masks known values, re-imputes them, and reports accuracy / RMSE metrics.
- Method comparison -- head-to-head benchmark of KNN, RF, and MICE on every imputed column using the same folds and feature sets.
- Cross-table features --
join_tablespulls parent-table columns and auto-generates within-household mode features;aggregate_frompivots child rows up to a parent.
Typical pipeline position
steps:
- name: load_data
- name: custom_cleaning
- name: imputation # ← after cleaning, before linking
- name: link_trips
- name: joint_trips
- name: extract_tours
Supported relationships for joining tables
These parent → child relationships are supported for join_tables
| Child table | Parent table | Join key |
|---|---|---|
| persons | households | hh_id |
| days | persons / households | person_id / hh_id |
| unlinked_trips | days / persons / households | day_id / person_id / hh_id |
| linked_trips | days / persons / households | day_id / person_id / hh_id |
| tours | persons / households | person_id / hh_id |
Missing-data assumptions
- KNN assumes similar records (by feature distance) share similar values.
- MICE assumes Missing At Random (MAR): missingness may depend on observed values but not on the missing value itself.
- If data is Missing Not At Random (MNAR), results may be biased.
Current limitations
- No stratified imputation (no
group_byoption for within-group models). - No support for exogenous data sources (PUMS, land use data).
- High-cardinality one-hot encoding can slow MICE convergence -- move
ordinal/count variables to
numeric_featuresto mitigate.
References
- van Buuren, S. & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67.
- Troyanskaya, O. et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.
__all__
module-attribute
__all__ = ['imputation']
imputation
imputation(
households: pl.DataFrame | None = None,
persons: pl.DataFrame | None = None,
days: pl.DataFrame | None = None,
unlinked_trips: pl.DataFrame | None = None,
linked_trips: pl.DataFrame | None = None,
tours: pl.DataFrame | None = None,
impute_columns: dict[str, list[dict[str, Any]]] | None = None,
create_flags: bool = True,
random_state: int | None = None,
validate_imputation: dict[str, Any] | None = None,
) -> dict[str, pl.DataFrame]
Impute missing values using KNN, Random Forest, and/or MICE methods.
Each config block specifies its method (knn, rf, or mice)
along with the method-specific parameters. Configs are grouped by method
and executed in a fixed order (KNN → RF → MICE) across all tables so that
later phases can benefit from values filled in earlier phases.
Handling Missing Values with Enum Labels
Survey data often uses special codes for missing values (e.g. 995 for "Missing Response", 999 for "Prefer not to answer"). Use enum member names (labels) rather than raw numeric values in the config:
missing_values: [MISSING, PNTA] # enum labels, not 995/999
The module automatically:
- Maps the table name to the appropriate codebook module
(e.g.
households→data_canon.codebook.households). - Finds the enum class whose
canonical_field_namematches the target column (e.g.income_broad→IncomeBroad). - Resolves enum member names to their values
(e.g.
MISSING→ 995). - Replaces those values with null before imputation.
For MICE with multiple columns, missing_values can be a dict
mapping each column to its own labels, or a single list applied to
all columns:
# Per-column
missing_values:
race: [MISSING]
ethnicity: [MISSING, PNTA]
# Shared
missing_values: [MISSING, PNTA] # applied to all columns
Cross-Table Features
By default only features from the same table are used. Adding
join_tables to a config block pulls columns from parent tables
via left-join on known foreign keys, which can significantly improve
quality.
Behaviour:
- Columns from the specified parent table(s) are joined onto the
child table (e.g.
persons←householdsviahh_id). - For each target column a
hh_mode_{column}feature is auto-generated — the mode of that column among other household members (exclude-self). This captures within-household correlation (e.g. siblings sharing race/ethnicity). - Auto-generated
hh_mode_*columns are appended tocategorical_featuresautomatically. - After imputation all joined/aggregated columns are stripped; the output schema is unchanged.
Example
impute_columns: persons: - method: knn column: gender n_neighbors: 5 join_tables: [households] categorical_features: [age, employment, income_bin, residence_type] # ^^^^^^^^^^ ^^^^^^^^^^^^^^ # columns from the households table
Child-to-Parent Aggregation
Using the config, aggregate_from, this is
the reverse of join_tables: aggregate child rows up to a parent
table. Useful when imputing parent-level fields that depend on
household composition (e.g. predicting household income from the
employment/education mix of its members).
For each child table and each field listed under pivot_count, the
module groups child rows by the parent's FK and creates one column per
unique value, counting occurrences. Generated columns are named
{child_table}_count_{field}_{value} and are automatically added to
numeric_features. After imputation, all generated columns are
stripped.
Example
impute_columns: households: - method: mice columns: [income_bin] aggregate_from: persons: pivot_count: [employment, education, student] categorical_features: [residence_type, residence_rent_own] max_iter: 10
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
households
|
pl.DataFrame | None
|
Households table (optional). |
None
|
persons
|
pl.DataFrame | None
|
Persons table (optional). |
None
|
days
|
pl.DataFrame | None
|
Days table (optional). |
None
|
unlinked_trips
|
pl.DataFrame | None
|
Unlinked trips table (optional). |
None
|
linked_trips
|
pl.DataFrame | None
|
Linked trips table (optional). |
None
|
tours
|
pl.DataFrame | None
|
Tours table (optional). |
None
|
impute_columns
|
dict[str, list[dict[str, Any]]] | None
|
Dict mapping table names to list of imputation configs.
Every config dict must include a KNN (
Random Forest (
MICE (
At least one of |
None
|
create_flags
|
bool
|
Whether to create |
True
|
random_state
|
int | None
|
Random seed for reproducibility across all imputation. |
None
|
validate_imputation
|
dict[str, Any] | None
|
Optional validation config with keys:
|
None
|
Returns:
| Type | Description |
|---|---|
dict[str, pl.DataFrame]
|
Dictionary of imputed tables. When validation is enabled, an extra |
dict[str, pl.DataFrame]
|
key |
dict[str, pl.DataFrame]
|
table, variable, method, type, n_samples, n_folds, accuracy, |
dict[str, pl.DataFrame]
|
precision, recall, f1, rmse, mae, r2. |
dict[str, pl.DataFrame]
|
When |
dict[str, pl.DataFrame]
|
|
dict[str, pl.DataFrame]
|
KNN, RF, and MICE for every imputed column. |
Example config
.. code-block:: yaml
impute_columns:
households:
- method: knn
column: income_broad
missing_values: [MISSING, PNTA]
n_neighbors: 5
neighbor_weights: distance
numeric_features: [num_persons, num_vehicles, num_workers]
persons:
- method: knn
column: gender
missing_values: [MISSING]
n_neighbors: 5
join_tables: [households]
numeric_features: [age]
categorical_features: [relationship, employment, income_bin]
- method: rf
column: education
missing_values: [MISSING]
n_estimators: 200
max_depth: 15
numeric_features: [age]
categorical_features: [employment, occupation]
- method: mice
columns: [race, ethnicity]
missing_values:
race: [MISSING]
ethnicity: [MISSING, PNTA]
join_tables: [households]
max_iter: 10
numeric_features: [age]
random_state: 42
create_flags: true
validate_imputation:
enabled: true
n_folds: 5
sample_pct: 5.0
processing.imputation.generic_impute
Generic imputation step using KNN, Random Forest, and MICE methods.
imputation
imputation(
households: pl.DataFrame | None = None,
persons: pl.DataFrame | None = None,
days: pl.DataFrame | None = None,
unlinked_trips: pl.DataFrame | None = None,
linked_trips: pl.DataFrame | None = None,
tours: pl.DataFrame | None = None,
impute_columns: dict[str, list[dict[str, Any]]] | None = None,
create_flags: bool = True,
random_state: int | None = None,
validate_imputation: dict[str, Any] | None = None,
) -> dict[str, pl.DataFrame]
Impute missing values using KNN, Random Forest, and/or MICE methods.
Each config block specifies its method (knn, rf, or mice)
along with the method-specific parameters. Configs are grouped by method
and executed in a fixed order (KNN → RF → MICE) across all tables so that
later phases can benefit from values filled in earlier phases.
Handling Missing Values with Enum Labels
Survey data often uses special codes for missing values (e.g. 995 for "Missing Response", 999 for "Prefer not to answer"). Use enum member names (labels) rather than raw numeric values in the config:
missing_values: [MISSING, PNTA] # enum labels, not 995/999
The module automatically:
- Maps the table name to the appropriate codebook module
(e.g.
households→data_canon.codebook.households). - Finds the enum class whose
canonical_field_namematches the target column (e.g.income_broad→IncomeBroad). - Resolves enum member names to their values
(e.g.
MISSING→ 995). - Replaces those values with null before imputation.
For MICE with multiple columns, missing_values can be a dict
mapping each column to its own labels, or a single list applied to
all columns:
# Per-column
missing_values:
race: [MISSING]
ethnicity: [MISSING, PNTA]
# Shared
missing_values: [MISSING, PNTA] # applied to all columns
Cross-Table Features
By default only features from the same table are used. Adding
join_tables to a config block pulls columns from parent tables
via left-join on known foreign keys, which can significantly improve
quality.
Behaviour:
- Columns from the specified parent table(s) are joined onto the
child table (e.g.
persons←householdsviahh_id). - For each target column a
hh_mode_{column}feature is auto-generated — the mode of that column among other household members (exclude-self). This captures within-household correlation (e.g. siblings sharing race/ethnicity). - Auto-generated
hh_mode_*columns are appended tocategorical_featuresautomatically. - After imputation all joined/aggregated columns are stripped; the output schema is unchanged.
Example
impute_columns: persons: - method: knn column: gender n_neighbors: 5 join_tables: [households] categorical_features: [age, employment, income_bin, residence_type] # ^^^^^^^^^^ ^^^^^^^^^^^^^^ # columns from the households table
Child-to-Parent Aggregation
Using the config, aggregate_from, this is
the reverse of join_tables: aggregate child rows up to a parent
table. Useful when imputing parent-level fields that depend on
household composition (e.g. predicting household income from the
employment/education mix of its members).
For each child table and each field listed under pivot_count, the
module groups child rows by the parent's FK and creates one column per
unique value, counting occurrences. Generated columns are named
{child_table}_count_{field}_{value} and are automatically added to
numeric_features. After imputation, all generated columns are
stripped.
Example
impute_columns: households: - method: mice columns: [income_bin] aggregate_from: persons: pivot_count: [employment, education, student] categorical_features: [residence_type, residence_rent_own] max_iter: 10
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
households
|
pl.DataFrame | None
|
Households table (optional). |
None
|
persons
|
pl.DataFrame | None
|
Persons table (optional). |
None
|
days
|
pl.DataFrame | None
|
Days table (optional). |
None
|
unlinked_trips
|
pl.DataFrame | None
|
Unlinked trips table (optional). |
None
|
linked_trips
|
pl.DataFrame | None
|
Linked trips table (optional). |
None
|
tours
|
pl.DataFrame | None
|
Tours table (optional). |
None
|
impute_columns
|
dict[str, list[dict[str, Any]]] | None
|
Dict mapping table names to list of imputation configs.
Every config dict must include a KNN (
Random Forest (
MICE (
At least one of |
None
|
create_flags
|
bool
|
Whether to create |
True
|
random_state
|
int | None
|
Random seed for reproducibility across all imputation. |
None
|
validate_imputation
|
dict[str, Any] | None
|
Optional validation config with keys:
|
None
|
Returns:
| Type | Description |
|---|---|
dict[str, pl.DataFrame]
|
Dictionary of imputed tables. When validation is enabled, an extra |
dict[str, pl.DataFrame]
|
key |
dict[str, pl.DataFrame]
|
table, variable, method, type, n_samples, n_folds, accuracy, |
dict[str, pl.DataFrame]
|
precision, recall, f1, rmse, mae, r2. |
dict[str, pl.DataFrame]
|
When |
dict[str, pl.DataFrame]
|
|
dict[str, pl.DataFrame]
|
KNN, RF, and MICE for every imputed column. |
Example config
.. code-block:: yaml
impute_columns:
households:
- method: knn
column: income_broad
missing_values: [MISSING, PNTA]
n_neighbors: 5
neighbor_weights: distance
numeric_features: [num_persons, num_vehicles, num_workers]
persons:
- method: knn
column: gender
missing_values: [MISSING]
n_neighbors: 5
join_tables: [households]
numeric_features: [age]
categorical_features: [relationship, employment, income_bin]
- method: rf
column: education
missing_values: [MISSING]
n_estimators: 200
max_depth: 15
numeric_features: [age]
categorical_features: [employment, occupation]
- method: mice
columns: [race, ethnicity]
missing_values:
race: [MISSING]
ethnicity: [MISSING, PNTA]
join_tables: [households]
max_iter: 10
numeric_features: [age]
random_state: 42
create_flags: true
validate_imputation:
enabled: true
n_folds: 5
sample_pct: 5.0
processing.imputation.knn
KNN-based imputation for missing values.
impute_knn
impute_knn(
df: pl.DataFrame,
column: str,
n_neighbors: int = 5,
neighbor_weights: Literal["uniform", "distance"] = "distance",
numeric_features: list[str] | None = None,
categorical_features: list[str] | None = None,
) -> tuple[pl.DataFrame, dict[str, Any]]
Impute missing values in a single column using K-Nearest Neighbors.
Best for: single columns with isolated missing values where similar records exist in the dataset.
How it works:
- Build a feature matrix from
numeric_features(used as-is) andcategorical_features(one-hot encoded for distance calculation). - Non-contiguous integer codes (e.g. enum values 1, 2, 3, 995, 999) are automatically encoded to dense 0..N codes so they don't distort distance calculations, then decoded back after imputation.
- For each row with a missing value, find the K most similar records based on Euclidean distance across all features.
- Impute the missing value using the weighted average (or mode for categoricals) of the K neighbours.
neighbor_weights='distance' weights closer neighbours more heavily;
neighbor_weights='uniform' treats all K neighbours equally.
Example use cases:
- Missing trip mode when other trip attributes are known.
- Missing person age when household/demographic info is available.
- Missing trip distance when other spatial/temporal features exist.
Performance: O(n log n) complexity; scales well to medium-large datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pl.DataFrame
|
DataFrame containing the column to impute. |
required |
column
|
str
|
Name of the column to impute. |
required |
n_neighbors
|
int
|
Number of similar records to use (default: 5). |
5
|
neighbor_weights
|
Literal['uniform', 'distance']
|
|
'distance'
|
numeric_features
|
list[str] | None
|
Numeric/continuous feature columns. Used as-is. |
None
|
categorical_features
|
list[str] | None
|
Categorical feature columns. One-hot encoded into binary columns for distance calculation. |
None
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
Tuple of (imputed DataFrame, stats dict). The stats dict contains |
dict[str, Any]
|
|
processing.imputation.random_forest
Random Forest imputation for missing values.
impute_random_forest
impute_random_forest(
df: pl.DataFrame,
column: str,
n_estimators: int = 100,
max_depth: int | None = None,
random_state: int | None = None,
numeric_features: list[str] | None = None,
categorical_features: list[str] | None = None,
) -> tuple[pl.DataFrame, dict[str, Any]]
Impute missing values in a single column using Random Forest.
Best for: single columns with complex non-linear relationships or mixed feature types where KNN may struggle with decision boundaries.
How it works:
- Split rows into known (have a value) and missing (need imputation).
- Train a Random Forest model on the known rows using all features.
- Automatically select
RandomForestClassifierfor categorical targets (integer / string dtypes) orRandomForestRegressorfor continuous targets (float dtypes). - Predict missing values using the trained model.
- NaN values in features are filled with column medians before training.
Non-contiguous integer codes (e.g. enum values 1, 2, 3, 995, 999) are automatically encoded to dense 0..N codes so they don't distort the model, then decoded back after prediction.
Example use cases:
- Missing education level when employment, occupation, and age are available.
- Missing income category with many mixed-type predictors.
- Cases where KNN struggles with non-linear decision boundaries.
Performance: trains on known values only; handles mixed types well but can be memory-intensive with many trees.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pl.DataFrame
|
DataFrame containing the column to impute. |
required |
column
|
str
|
Name of the column to impute. |
required |
n_estimators
|
int
|
Number of trees in the forest (default: 100). |
100
|
max_depth
|
int | None
|
Maximum tree depth (default: None = unlimited). |
None
|
random_state
|
int | None
|
Random seed for reproducibility. |
None
|
numeric_features
|
list[str] | None
|
Numeric/continuous feature columns. |
None
|
categorical_features
|
list[str] | None
|
Categorical feature columns (one-hot encoded). |
None
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
Tuple of (imputed DataFrame, stats dict). The stats dict contains |
dict[str, Any]
|
|
processing.imputation.mice
MICE-based imputation for missing values.
impute_mice
impute_mice(
df: pl.DataFrame,
columns: list[str],
max_iter: int = 10,
random_state: int | None = None,
numeric_features: list[str] | None = None,
categorical_features: list[str] | None = None,
verbose: bool = True,
) -> tuple[pl.DataFrame, dict[str, Any]]
Impute missing values in multiple correlated columns using MICE.
Best for: multiple correlated columns with missing values (e.g. depart_hour / arrive_hour / duration, or race / ethnicity).
MICE (Multiple Imputation by Chained Equations) imputes several variables together, preserving their joint distribution.
How it works:
- Initialise missing values with simple imputation (mean/mode).
-
For each column with missing values:
- Treat it as the target variable.
- Use the other columns as predictors in a regression model.
- Predict and update missing values.
-
Repeat iteratively until convergence (
max_iterrounds).
Categorical integer columns (e.g. enum codes 1-6) are automatically encoded to dense 0..N codes before imputation and decoded back afterwards. String columns are auto-encoded to integers for the MICE model and decoded to original labels after imputation.
Assumes Missing At Random (MAR): missingness may depend on observed values but not on the missing value itself. If data is Missing Not At Random (MNAR), results may be biased.
Example use cases:
- Time fields (depart_hour, arrive_hour, duration) — highly correlated.
- Spatial coordinates (origin_lat, origin_lon) — spatially correlated.
- Socio-demographic variables (income, education, employment) — often correlated.
Performance: iterative, can be slow for many columns or large datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pl.DataFrame
|
DataFrame containing the columns to impute. |
required |
columns
|
list[str]
|
Column names to impute together. |
required |
max_iter
|
int
|
Maximum number of imputation rounds (default: 10). |
10
|
random_state
|
int | None
|
Random seed for reproducibility. |
None
|
numeric_features
|
list[str] | None
|
Numeric/continuous feature columns. |
None
|
categorical_features
|
list[str] | None
|
Categorical feature columns (one-hot encoded). |
None
|
verbose
|
bool
|
Whether to log progress during imputation. |
True
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
Tuple of (imputed DataFrame, stats dict). The stats dict is keyed |
dict[str, Any]
|
by column name, each containing |
tuple[pl.DataFrame, dict[str, Any]]
|
|
processing.imputation.comparison
Head-to-head comparison of imputation methods via k-fold cross-validation.
For each imputed column, every supported method (KNN, RF, MICE) is evaluated using the same k-fold splits and the same feature set. The result is a summary DataFrame that makes it easy to pick the best method for each field.
compare_imputation_methods
compare_imputation_methods(
impute_columns: dict[str, list[dict[str, Any]]],
tables: dict[str, pl.DataFrame],
n_folds: int = 5,
sample_pct: float = 5.0,
random_state: int | None = None,
output_path: str | None = None,
) -> pl.DataFrame
Run k-fold validation for every method on every imputed column.
For each unique (table, column) found in impute_columns, KNN, RF, and MICE are each evaluated using the same enrichment and feature set. This produces a comparison table that helps choose the best method per field.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
impute_columns
|
dict[str, list[dict[str, Any]]]
|
The same config dict passed to |
required |
tables
|
dict[str, pl.DataFrame]
|
Dict of canonical DataFrames (already cleaned). |
required |
n_folds
|
int
|
Number of cross-validation folds. |
5
|
sample_pct
|
float
|
Percentage of non-missing values to test (0-100). |
5.0
|
random_state
|
int | None
|
Random seed for reproducibility. |
None
|
output_path
|
str | None
|
Optional path to save the comparison CSV. |
None
|
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
Polars DataFrame with columns: table, variable, method, type, |
pl.DataFrame
|
n_samples, n_folds, accuracy, precision, recall, f1, rmse, mae, r2. |
processing.imputation.flags
Diagnostic flag columns for tracking imputed values.
When create_flags=True (the default), the imputation step creates a
boolean column for every imputed field::
{column}_imputed - True if the value was filled in, False otherwise
Examples: mode_imputed, distance_imputed, age_imputed.
Use cases:
- Quality control: identify records that contain imputed values.
- Sensitivity analysis: compare results with vs. without imputed records.
- Downstream modelling: include imputation status as a feature.
create_flag_columns
create_flag_columns(
df: pl.DataFrame, original_df: pl.DataFrame, columns: list[str]
) -> pl.DataFrame
Create boolean flag columns for multiple imputed columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pl.DataFrame
|
DataFrame with imputed values |
required |
original_df
|
pl.DataFrame
|
Original DataFrame before imputation |
required |
columns
|
list[str]
|
List of column names that were imputed |
required |
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
DataFrame with added flag columns named '{column}_imputed' |
create_flag_column
create_flag_column(
df: pl.DataFrame, original_df: pl.DataFrame, column: str
) -> pl.DataFrame
Create a boolean flag column indicating which values were imputed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
pl.DataFrame
|
DataFrame with imputed values |
required |
original_df
|
pl.DataFrame
|
Original DataFrame before imputation |
required |
column
|
str
|
Name of the column that was imputed |
required |
Returns:
| Type | Description |
|---|---|
pl.DataFrame
|
DataFrame with added flag column named '{column}_imputed' |
processing.imputation.validation
K-fold cross-validation for imputation quality assessment.
Optional validation that assesses how accurate the imputation is by:
- Sampling a percentage of non-missing values (user-configurable, e.g. 5%).
- Artificially masking those values (setting them to null).
- Imputing them using k-fold cross-validation.
- Comparing imputed vs. actual values.
- Computing and logging quality metrics.
Metrics by data type:
- Categorical columns (e.g. mode, purpose): Accuracy, Precision, Recall, F1-Score.
- Continuous columns (e.g. distance, duration): RMSE, MAE, R².
Configuration::
random_state: 42
validate_imputation:
enabled: true
n_folds: 5 # number of CV folds (default: 5)
sample_pct: 5.0 # % of complete values to test (default: 5%)
Example output::
============================================================
Imputation Validation Results
============================================================
Column: mode (categorical, n=250 test samples)
Accuracy: 0.876
Precision: 0.883
Recall: 0.876
F1-Score: 0.872
Column: distance (continuous, n=250 test samples)
RMSE: 2.34
MAE: 1.82
R²: 0.721
============================================================
Validation uses the same enrichment (joins, aggregations) as the real imputation pipeline, so metrics reflect the full feature set. Note that validation adds computational overhead (k-fold = k x the imputation time); it is recommended for development/testing and optional in production.
validate_knn_imputation
validate_knn_imputation(
df: pl.DataFrame,
column: str,
n_folds: int,
sample_pct: float,
n_neighbors: int,
neighbor_weights: Literal["uniform", "distance"],
random_state: int | None = None,
numeric_features: list[str] | None = None,
categorical_features: list[str] | None = None,
) -> dict[str, Any]
Validate KNN imputation quality using k-fold cross-validation.
validate_mice_imputation
validate_mice_imputation(
df: pl.DataFrame,
columns: list[str],
n_folds: int,
sample_pct: float,
max_iter: int,
random_state: int | None = None,
numeric_features: list[str] | None = None,
categorical_features: list[str] | None = None,
) -> dict[str, dict[str, Any]]
Validate MICE imputation quality using k-fold cross-validation.
validate_rf_imputation
validate_rf_imputation(
df: pl.DataFrame,
column: str,
n_folds: int,
sample_pct: float,
n_estimators: int = 100,
max_depth: int | None = None,
random_state: int | None = None,
numeric_features: list[str] | None = None,
categorical_features: list[str] | None = None,
) -> dict[str, Any]
Validate Random Forest imputation quality using k-fold cross-validation.
log_validation_results
log_validation_results(
metrics: dict[str, Any] | dict[str, dict[str, Any]],
) -> None
Log validation metrics in a readable format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
dict[str, Any] | dict[str, dict[str, Any]]
|
Dictionary of metrics (single column or per-column dict) |
required |