Handling categorical data involves several steps to convert it into a format that machine learning algorithms can process effectively. Here are common methods used to handle categorical data:
1. Label Encoding
Label encoding converts categorical values into numerical values. Each unique category is assigned an integer value.
Example:
from sklearn.preprocessing import LabelEncoder
data = ['cat', 'dog', 'mouse']
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data)
print(encoded_data) # Output: [0, 1, 2]
2. One-Hot Encoding
One-hot encoding creates a new binary column for each category. This method is useful for nominal data where there is no intrinsic order among the categories.
Example:
import pandas as pd
data = pd.DataFrame({'animal': ['cat', 'dog', 'mouse']})
one_hot_encoded_data = pd.get_dummies(data, columns=['animal'])
print(one_hot_encoded_data)
# Output:
# animal_cat animal_dog animal_mouse
# 0 1 0 0
# 1 0 1 0
# 2 0 0 1
3. Ordinal Encoding
Similar to label encoding but suitable for ordinal data where the order of categories is meaningful.
Example:
from sklearn.preprocessing import OrdinalEncoder
data = [['low'], ['medium'], ['high']]
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
encoded_data = ordinal_encoder.fit_transform(data)
print(encoded_data) # Output: [[0.], [1.], [2.]]
4. Binary Encoding
Binary encoding first converts categories to integers, then converts those integers to binary code. This can help reduce the dimensionality compared to one-hot encoding.
Example:
In python, you can use binary encoding from category_encoders package, which can be installed by
pip install category_encoders
After that
import category_encoders as ce
data = pd.DataFrame({'animal': ['cat', 'dog', 'mouse']})
binary_encoder = ce.BinaryEncoder(cols=['animal'])
encoded_data = binary_encoder.fit_transform(data)
print(encoded_data)
# Output:
# animal_0 animal_1
# 0 0 1
# 1 1 0
# 2 1 1
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.