Often in machine learning, we want to convert categorical variables into some type of numeric format that can be readily used by algorithms.
There are two common ways to convert categorical variables into numeric variables:
1. Label Encoding: Assign each categorical value an integer value based on alphabetical order.
2. One Hot Encoding: Create new variables that take on values 0 and 1 to represent the original categorical values.
For example, suppose we have the following dataset with two variables and we would like to convert the Team variable from a categorical variable into a numeric one:
The following examples show how to use both label encoding and one hot encoding to do so.
Example: Using Label Encoding
Using label encoding, we would convert each unique value in the Team column into an integer value based on alphabetical order:
In this example, we can see:
- Each “A” value has been converted to 0.
- Each “B” value has been converted to 1.
- Each “C” value has been converted to 2.
We have successfully converted the Team column from a categorical variable into a numeric variable.
Example: Using One Hot Encoding
Using one hot encoding, we would convert the Team column into new variables that contain only 0 and 1 values:
When using this approach, we create one new column for each unique value in the original categorical variable.
For example, the categorical variable Team had three unique values so we created three new columns in the dataset that all contain 0 or 1 values.
Here’s how to interpret the values in the new columns:
- The value in the new Team_A column is 1 if the original value in the Team column was A. Otherwise, the value is 0.
- The value in the new Team_B column is 1 if the original value in the Team column was B. Otherwise, the value is 0.
- The value in the new Team_C column is 1 if the original value in the Team column was C. Otherwise, the value is 0.
We have successfully converted the Team column from a categorical variable into three numeric variables – sometimes referred to as “dummy” variables.
Note: When using these “dummy” variables in a regression model or other machine learning algorithm, be sure to avoid the dummy variable trap.
When to Use Label Encoding vs. One Hot Encoding
In most scenarios, one hot encoding is the preferred way to convert a categorical variable into a numeric variable because label encoding makes it seem that there is a ranking between values.
For example, consider when we used label encoding to convert team into a numeric variable:
The label encoded data makes it seem like team C is somehow greater or larger than teams B and A since it has a higher numeric value.
This isn’t an issue if the original categorical variable actually is an ordinal variable with a natural ordering or ranking, but in many scenarios this isn’t the case.
However, one drawback of one hot encoding is that it requires you to make as many new variables as there are unique values in the original categorical variable.
This means that if your categorical variable has 100 unique values, you’ll have to create 100 new variables when using one hot encoding.
Depending on the size of your dataset and the type of variables you’re working with, you may prefer one hot encoding or label encoding.
The following tutorials explain how to perform label encoding in practice:
The following tutorials explain how to perform one hot encoding in practice: