Encoding categorical data in python

Handling categorical data involves several steps to convert it into a format that machine learning algorithms can process effectively. Here are common methods used to handle categorical data:

1. Label Encoding

Label encoding converts categorical values into numerical values. Each unique category is assigned an integer value.

Example:

from sklearn.preprocessing import LabelEncoder

data = ['cat', 'dog', 'mouse']
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data)
print(encoded_data)  # Output: [0, 1, 2]

2. One-Hot Encoding

One-hot encoding creates a new binary column for each category. This method is useful for nominal data where there is no intrinsic order among the categories.

Example:

import pandas as pd

data = pd.DataFrame({'animal': ['cat', 'dog', 'mouse']})
one_hot_encoded_data = pd.get_dummies(data, columns=['animal'])
print(one_hot_encoded_data)
# Output:
#    animal_cat  animal_dog  animal_mouse
# 0           1           0             0
# 1           0           1             0
# 2           0           0             1

3. Ordinal Encoding

Similar to label encoding but suitable for ordinal data where the order of categories is meaningful.

Example:

from sklearn.preprocessing import OrdinalEncoder

data = [['low'], ['medium'], ['high']]
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
encoded_data = ordinal_encoder.fit_transform(data)
print(encoded_data)  # Output: [[0.], [1.], [2.]]

4. Binary Encoding

Binary encoding first converts categories to integers, then converts those integers to binary code. This can help reduce the dimensionality compared to one-hot encoding.

Example:

In python, you can use binary encoding from category_encoders package, which can be installed by

pip install category_encoders

After that

import category_encoders as ce

data = pd.DataFrame({'animal': ['cat', 'dog', 'mouse']})
binary_encoder = ce.BinaryEncoder(cols=['animal'])
encoded_data = binary_encoder.fit_transform(data)
print(encoded_data)
# Output:
#    animal_0  animal_1
# 0         0         1
# 1         1         0
# 2         1         1


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!