Sun. Jul 20th, 2025

A comic guide to mean/median/mode imputation & Python codes

Handling missing data is a common preprocessing task in machine learning. In scikit-learn, you can handle missing data by using imputation techniques provided by the SimpleImputer class or by employing other strategies like dropping rows/columns with missing values. Here’s a basic example of how to use SimpleImputer:

(note that in python, we usually use np.nan to denote a missing entry)

from sklearn.impute import SimpleImputer
import numpy as np

# Sample data with missing values represented as np.nan
X = np.array([[1, 2, np.nan], [3, np.nan, 5], [7, 8, 9]])

# Create the imputer object with a strategy of replacing missing values with the mean of the column
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer to the data and transform it
X_imputed = imputer.fit_transform(X)

print(X_imputed)

The strategy parameter in the SimpleImputer class can take the following values:

  1. ‘mean’: This will replace the missing values using the mean along each column. This can only be used with numeric data.
  2. ‘median’: This will replace the missing values using the median along each column. This can only be used with numeric data.
  3. ‘most_frequent’: This will replace the missing values using the most frequent value along each column. This can be used with strings or numeric data.
  4. ‘constant’: This will replace the missing values with fill_value. This strategy can be used with strings or numeric data.

Example of using ‘constant’:

imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)

In this example, all missing values in the dataset will be replaced by 0.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Related Post

Leave a Reply

error: Content is protected !!