Often in machine learning, we want to convert categorical variables into some type of numeric format that can be readily used by algorithms.

There are two common ways to convert categorical variables into numeric variables:

**1. Label Encoding:** Assign each categorical value an integer value based on alphabetical order.

**2. One Hot Encoding:** Create new variables that take on values 0 and 1 to represent the original categorical values.

For example, suppose we have the following dataset with two variables and we would like to convert the **Team** variable from a categorical variable into a numeric one:

The following examples show how to use both **label encoding** and **one hot encoding** to do so.

**Example: Using Label Encoding**

Using **label encoding**, we would convert each unique value in the **Team** column into an integer value based on alphabetical order:

In this example, we can see:

- Each “A” value has been converted to
**0**. - Each “B” value has been converted to
**1**. - Each “C” value has been converted to
**2**.

We have successfully converted the **Team** column from a categorical variable into a numeric variable.

**Example: Using One Hot Encoding**

Using **one hot encoding**, we would convert the **Team** column into new variables that contain only 0 and 1 values:

When using this approach, we create one new column for each unique value in the original categorical variable.

For example, the categorical variable **Team** had **three unique values** so we created **three new columns** in the dataset that all contain 0 or 1 values.

Here’s how to interpret the values in the new columns:

- The value in the new
**Team_A**column is 1 if the original value in the**Team**column was A. Otherwise, the value is 0. - The value in the new
**Team_B**column is 1 if the original value in the**Team**column was B. Otherwise, the value is 0. - The value in the new
**Team_C**column is 1 if the original value in the**Team**column was C. Otherwise, the value is 0.

We have successfully converted the **Team** column from a categorical variable into three numeric variables – sometimes referred to as “dummy” variables.

**Note**: When using these “dummy” variables in a regression model or other machine learning algorithm, be sure to avoid the dummy variable trap.

**When to Use Label Encoding vs. One Hot Encoding**

In most scenarios, **one hot encoding** is the preferred way to convert a categorical variable into a numeric variable because **label encoding** makes it seem that there is a ranking between values.

For example, consider when we used label encoding to convert team into a numeric variable:

The label encoded data makes it seem like team C is somehow greater or larger than teams B and A since it has a higher numeric value.

This isn’t an issue if the original categorical variable actually is an ordinal variable with a natural ordering or ranking, but in many scenarios this isn’t the case.

However, one drawback of **one hot encoding** is that it requires you to make as many new variables as there are unique values in the original categorical variable.

This means that if your categorical variable has 100 unique values, you’ll have to create 100 new variables when using one hot encoding.

Depending on the size of your dataset and the type of variables you’re working with, you may prefer one hot encoding or label encoding.

**Additional Resources**

The following tutorials explain how to perform **label encoding** in practice:

The following tutorials explain how to perform **one hot encoding** in practice: