







Handling missing data is a common preprocessing task in machine learning. In scikit-learn, you can handle missing data by using imputation techniques provided by the SimpleImputer
class or by employing other strategies like dropping rows/columns with missing values. Here’s a basic example of how to use SimpleImputer
:
(note that in python, we usually use np.nan
to denote a missing entry)
from sklearn.impute import SimpleImputer
import numpy as np
# Sample data with missing values represented as np.nan
X = np.array([[1, 2, np.nan], [3, np.nan, 5], [7, 8, 9]])
# Create the imputer object with a strategy of replacing missing values with the mean of the column
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Fit the imputer to the data and transform it
X_imputed = imputer.fit_transform(X)
print(X_imputed)
The strategy
parameter in the SimpleImputer
class can take the following values:
- ‘mean’: This will replace the missing values using the mean along each column. This can only be used with numeric data.
- ‘median’: This will replace the missing values using the median along each column. This can only be used with numeric data.
- ‘most_frequent’: This will replace the missing values using the most frequent value along each column. This can be used with strings or numeric data.
- ‘constant’: This will replace the missing values with fill_value. This strategy can be used with strings or numeric data.
Example of using ‘constant’:
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
In this example, all missing values in the dataset will be replaced by 0.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.