How to Scale and Normalize Data with Scikit-learn’s Preprocessing Tools

How to Scale and Normalize Data with Scikit-learn's Preprocessing Tools

Let’s learn how to use Scikit-Learn to scale and normalize your data.

Preparation

We need the Pandas and Scikit-Learn installed in your environment, so make sure it is installed in your environment. If not, you can install them via pip using the following code:

pip install pandas scikit-learn

Then, we can import the packages into your environment:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Once ready, let’s create sample data for the whole example.

sample_data = {'Feature 1': [10, 20, 30, 40, 50], 'Feature 2': [18, 29, 31, 47, 68]}
df = pd.DataFrame(sample_data)
print(df)

The output:

   Feature 1  Feature 2
0         10         18
1         20         29
2         30         31
3         40         47
4         50         68

We would try out scaling our data with Scikit-Learn.

Data Scaling with Scikit-Learn Preprocessing

Data scaling is a process to transform our data into a specific range (for example, range between 0 and 1). The process itself would not change the data distribution.

With our sample data, we would use the MinMax Scaler to perform data scaling.

# Min-Max Scaling
min_max_scaler = MinMaxScaler(feature_range = (0,1))
df_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)
print(df_min_max_scaled)

The output:

   Feature 1  Feature 2
0       0.00       0.00
1       0.25       0.22
2       0.50       0.26
3       0.75       0.58
4       1.00       1.00

Tweak the feature_range parameter to your intended range.

Data scaling is good when your features have a different scale and you do not want to change the data distribution. It’s often done when you want to use machine learning algorithms sensitive to features with different scales.

Data Normalization with Scikit-Learn Preprocessing

Data normalization is a process of transforming data into a standard distribution by adjusting the data mean to zero and standard deviation to one.

# Standard Scaling
scaler = StandardScaler()
df_standard_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

The output:

   Feature 1  Feature 2
0  -1.414214  -1.185711
1  -0.707107  -0.552564
2   0.000000  -0.437447
3   0.707107   0.483494
4   1.414214   1.692228

Data normalization is important if your statistical technique or algorithm requires your data to follow a standard distribution.

Knowing how to transform your data and when to do it is important to have a working data science project.

Additional Resources

Leave a Reply

Your email address will not be published. Required fields are marked *