How to Encode Categorical Variables with Scikit-learn

How to Encode Categorical Variables with Scikit-learn

Let’s learn to transform your categorical variables into numerical variables with Scikit-Learn.

Preparation

Make sure that the Pandas and Scikit-Learn are installed in your environment. If not, please install them via pip using the following code:

pip install pandas scikit-learn

Then, we can import the packages into your environment:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder

With the packages ready, let’s create sample data for this article.

sample_data = pd.DataFrame({
    'Size': ['Large', 'Medium', 'Large', 'Small', 'Medium', 'Small', 'Large'],
    'Color': ['Yellow', 'Red', 'Blue', 'Red', 'Yellow', 'Red', 'Blue'],
    'target': [1, 0, 1, 1, 0, 0, 1]
})
print(sample_data)

The output:

     Size   Color  target
0   Large  Yellow       1
1  Medium     Red       0
2   Large    Blue       1
3   Small     Red       1
4  Medium  Yellow       0
5   Small     Red       0
6   Large    Blue       1

Encode Categorical Variables with Scikit-Learn

Categorical encoding is a process of transforming the categorical variable into a data format that a machine learning algorithm can accept. Encoding would generally transform the categorical into numerical variables as many machine learning algorithms can only accept numerical input.

Let’s try the simplest encoding technique, which is the label encoder.

label_encoder = LabelEncoder()
label_encoded_data = label_encoder.fit_transform(sample_data['Size'])
label_encoded_series = pd.Series(label_encoded_data, name='Size_Label_Encoded')
print(label_encoded_series)

The output:

0    0
1    1
2    0
3    2
4    1
5    2
6    0
Name: Size_Label_Encoded, dtype: int32

Label encoder is a straightforward encoding process where it transforms each category into a unique integer. However, it could introduce an unintended ordinal relationship.

To avoid that, we can use the one-hot encoder to transform each category into binary columns.

one_hot_encoder = OneHotEncoder()
one_hot_encoded_data = one_hot_encoder.fit_transform(sample_data[['Color']])
one_hot_encoded_df = pd.DataFrame(one_hot_encoded_data, columns=one_hot_encoder.get_feature_names_out(['Color']))
print(one_hot_encoded_df)

The output:

  Color_Blue  Color_Red  Color_Yellow
0         0.0        0.0           1.0
1         0.0        1.0           0.0
2         1.0        0.0           0.0
3         0.0        1.0           0.0
4         0.0        0.0           1.0
5         0.0        1.0           0.0
6         1.0        0.0           0.0

The one-hot encoder is the most popular categorical encoding technique as it’s straightforward to assume each category as numerical features. However, it can lead to a high number of dimensions if there are many categories in the variable.

If your categorical variable contains an ordinal relationship, an ordinal encoder can preserve it during categorical encoding.

ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
ordinal_encoded_data = ordinal_encoder.fit_transform(sample_data[['Size']])
ordinal_encoded_series = pd.Series(ordinal_encoded_data.flatten(), name='Size_Ordinal_Encoded')
print(ordinal_encoded_series)

The output:

0    2.0
1    1.0
2    2.0
3    0.0
4    1.0
5    0.0
6    2.0
Name: Size_Ordinal_Encoded, dtype: float64

The order you pass into the ordinal encoder will guide you in knowing the category’s order.

Try to learn how to use the categorical encoder properly as it could decide if your machine learning project performs well or not.

Additional Resouces

Leave a Reply

Your email address will not be published. Required fields are marked *