A comic guide to mean/median/mode imputation & Python codes

Handling missing data is a common preprocessing task in machine learning. In scikit-learn, you can handle missing data by using imputation techniques provided by the SimpleImputer class or by employing other strategies like dropping rows/columns with missing values. Here’s a basic example of how to use SimpleImputer:

(note that in python, we usually use np.nan to denote a missing entry)

from sklearn.impute import SimpleImputer
import numpy as np

# Sample data with missing values represented as np.nan
X = np.array([[1, 2, np.nan], [3, np.nan, 5], [7, 8, 9]])

# Create the imputer object with a strategy of replacing missing values with the mean of the column
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer to the data and transform it
X_imputed = imputer.fit_transform(X)

print(X_imputed)

The strategy parameter in the SimpleImputer class can take the following values:

  1. ‘mean’: This will replace the missing values using the mean along each column. This can only be used with numeric data.
  2. ‘median’: This will replace the missing values using the median along each column. This can only be used with numeric data.
  3. ‘most_frequent’: This will replace the missing values using the most frequent value along each column. This can be used with strings or numeric data.
  4. ‘constant’: This will replace the missing values with fill_value. This strategy can be used with strings or numeric data.

Example of using ‘constant’:

imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)

In this example, all missing values in the dataset will be replaced by 0.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!