K-Nearest Neighbors (KNN) imputation in sklearn

K-Nearest Neighbors (KNN) imputation is another method to handle missing data. It uses the ‘k’ closest instances (rows) to each instance that contains any missing values to fill in those values.

In sklearn, you can use the KNNImputer class for this. Here is an example:

  1. Import Necessary Libraries
import numpy as np
from sklearn.impute import KNNImputer
  1. Create Data
# Assume we have a dataset with missing values
data = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
  1. Create KNNImputer Object
# We can specify number of neighbors we want to use 
imputer = KNNImputer(n_neighbors=2)
  1. Fit and Transform Data
imputed_data = imputer.fit_transform(data)

The fit_transform method will replace the missing values in the dataset with the mean value of the ‘k’ nearest neighbors.

Now imputed_data should be a dataset with no missing values. The missing values we had have been replaced by the mean of the ‘k’ nearest neighbors.

Again, consider the nature of your data and problem before deciding how to handle missing data. KNN imputation takes more computational resources than simple imputation methods, but it can provide better results when the data has patterns that the simple methods cannot capture.

fine-tuning k using cross-validation

Cross-validation can indeed be used to fine-tune the number of neighbors parameter (k) in KNN imputation. Here’s how you can do it:

  1. Import Necessary Libraries
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
  1. Create Data
# Assume we have a dataset with missing values 
# And a target variable y (for the purpose of cross-validation)
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
y = [0, 1, 0, 1]
  1. Create Cross-Validation and Model: for this, we need an evaluation criterion. This evaluation measure can be MSE, Mean Absolute Error, and can also be a classifier. Note that MSE is sensitive to outlier. So, here, let’s use a LogisticRegression classifier
cv = KFold(n_splits=5, random_state=1, shuffle=True)
model = LogisticRegression()

  1. Create Pipeline and Cross-Validate Different Values of K
results = []
for k in range(1, 6):
    # Create pipeline with KNNImputer and Logistic Regression model
    pipeline = Pipeline(steps=[('i', KNNImputer(n_neighbors=k)), ('m', model)])
    # Evaluate pipeline
    scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    # Append scores to results
    results.append(scores)

In this code, we’re trying out different values of k (from 1 to 5) for the KNNImputer, and evaluating the model performance using cross-validation. The results will be stored in the results list.

  1. Analyze and Select the Best Value of K

After running the cross-validation, you can analyze the results and select the value of k that gave the best performance.

Using train-test-split to select k

note that KNN can be very slow. Therefore, it’s maybe expensive to use cross-validation to select k. So, sometime, we want to use train-test-split to select k instead

To select the optimal k using a train-test split, you can follow these steps:

  1. Import Necessary Libraries
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
  1. Create Data
# Assume we have a dataset with missing values
# And a target variable y
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
y = [0, 1, 0, 1]
  1. Split Data into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
  1. Create Model: As explained before, we need an evaluation criterion. This evaluation measure can be MSE, Mean Absolute Error, and can also be a classifier. We can use Logistic Regression as evaluation measure as before. But when speed is of concern, we may as well just use the Mean Absolute Error (MAE) of the imputed data as the evaluation metric. In python, we need to import

    from sklearn.metrics import mean_absolute_error
  1. Iterate Over Different Values of K, Fit, Predict and Evaluate
best_score = np.inf
best_k = 0
for k in range(1, 6):
    # Create imputer with k neighbors
    imputer = KNNImputer(n_neighbors=k)
    
    # Fit on the train data
    imputer.fit(X_train)
    
    # Impute test data
    X_test_imputed = imputer.transform(X_test)
    
    # Calculate MAE between original and imputed data
    mae = mean_absolute_error(X_test, X_test_imputed, multioutput='raw_values')
    
    # Compare and store the best score and corresponding k
    if mae < best_score:
        best_score = mae
        best_k = k

In this code, we’re iterating over different values of k (from 1 to 5) for the KNNImputer. Here, we are calculating the Mean Absolute Error between the original test data and the imputed test data, and selecting the k value that gives the smallest MAE.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!