

K-Nearest Neighbors (KNN) imputation is another method to handle missing data. It uses the ‘k’ closest instances (rows) to each instance that contains any missing values to fill in those values.
In sklearn
, you can use the KNNImputer
class for this. Here is an example:
- Import Necessary Libraries
import numpy as np from sklearn.impute import KNNImputer
- Create Data
# Assume we have a dataset with missing values data = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
- Create KNNImputer Object
# We can specify number of neighbors we want to use imputer = KNNImputer(n_neighbors=2)
- Fit and Transform Data
imputed_data = imputer.fit_transform(data)
The fit_transform
method will replace the missing values in the dataset with the mean value of the ‘k’ nearest neighbors.
Now imputed_data
should be a dataset with no missing values. The missing values we had have been replaced by the mean of the ‘k’ nearest neighbors.
Again, consider the nature of your data and problem before deciding how to handle missing data. KNN imputation takes more computational resources than simple imputation methods, but it can provide better results when the data has patterns that the simple methods cannot capture.
fine-tuning k using cross-validation
Cross-validation can indeed be used to fine-tune the number of neighbors parameter (k
) in KNN imputation. Here’s how you can do it:
- Import Necessary Libraries
import numpy as np from sklearn.impute import KNNImputer from sklearn.model_selection import cross_val_score, KFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression
- Create Data
# Assume we have a dataset with missing values # And a target variable y (for the purpose of cross-validation) X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]] y = [0, 1, 0, 1]
- Create Cross-Validation and Model: for this, we need an evaluation criterion. This evaluation measure can be MSE, Mean Absolute Error, and can also be a classifier. Note that MSE is sensitive to outlier. So, here, let’s use a
LogisticRegression
classifier
cv = KFold(n_splits=5, random_state=1, shuffle=True) model = LogisticRegression()
- Create Pipeline and Cross-Validate Different Values of K
results = [] for k in range(1, 6): # Create pipeline with KNNImputer and Logistic Regression model pipeline = Pipeline(steps=[('i', KNNImputer(n_neighbors=k)), ('m', model)]) # Evaluate pipeline scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # Append scores to results results.append(scores)
In this code, we’re trying out different values of k
(from 1 to 5) for the KNNImputer
, and evaluating the model performance using cross-validation. The results will be stored in the results
list.
- Analyze and Select the Best Value of K
After running the cross-validation, you can analyze the results and select the value of k
that gave the best performance.
Using train-test-split to select k
note that KNN can be very slow. Therefore, it’s maybe expensive to use cross-validation to select k. So, sometime, we want to use train-test-split to select k instead
To select the optimal k
using a train-test split, you can follow these steps:
- Import Necessary Libraries
import numpy as np from sklearn.impute import KNNImputer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression
- Create Data
# Assume we have a dataset with missing values # And a target variable y X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]] y = [0, 1, 0, 1]
- Split Data into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
- Create Model: As explained before, we need an evaluation criterion. This evaluation measure can be MSE, Mean Absolute Error, and can also be a classifier. We can use Logistic Regression as evaluation measure as before. But when speed is of concern, we may as well just use the Mean Absolute Error (MAE) of the imputed data as the evaluation metric. In python, we need to import
from sklearn.metrics import mean_absolute_error
- Iterate Over Different Values of K, Fit, Predict and Evaluate
best_score = np.inf best_k = 0 for k in range(1, 6): # Create imputer with k neighbors imputer = KNNImputer(n_neighbors=k) # Fit on the train data imputer.fit(X_train) # Impute test data X_test_imputed = imputer.transform(X_test) # Calculate MAE between original and imputed data mae = mean_absolute_error(X_test, X_test_imputed, multioutput='raw_values') # Compare and store the best score and corresponding k if mae < best_score: best_score = mae best_k = k
In this code, we’re iterating over different values of k
(from 1 to 5) for the KNNImputer
. Here, we are calculating the Mean Absolute Error between the original test data and the imputed test data, and selecting the k
value that gives the smallest MAE.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.