How to Normalize Data in Python


Often in statistics and machine learning, we normalize variables such that the range of the values is between 0 and 1.

The most common reason to normalize variables is when we conduct some type of multivariate analysis (i.e. we want to understand the relationship between several predictor variables and a response variable) and we want each variable to contribute equally to the analysis.

When variables are measured at different scales, they often do not contribute equally to the analysis. For example, if the values of one variable range from 0 to 100,000 and the values of another variable range from 0 to 100, the variable with the larger range will be given a larger weight in the analysis.

By normalizing the variables, we can be sure that each variable contributes equally to the analysis.

To normalize the values to be between 0 and 1, we can use the following formula:

xnorm = (xi – xmin) / (xmax – xmin)

where:

  • xnorm: The ith normalized value in the dataset
  • xiThe ith value in the dataset
  • xmax: The minimum value in the dataset
  • xmin: The maximum value in the dataset

The following examples show how to normalize one or more variables in Python.

Example 1: Normalize a NumPy Array

The following code shows how to normalize all values in a NumPy array:

import numpy as np 

#create NumPy array
data = np.array([[13, 16, 19, 22, 23, 38, 47, 56, 58, 63, 65, 70, 71]])

#normalize all values in array
data_norm = (data - data.min())/ (data.max() - data.min())

#view normalized values
data_norm

array([[0.        , 0.05172414, 0.10344828, 0.15517241, 0.17241379,
        0.43103448, 0.5862069 , 0.74137931, 0.77586207, 0.86206897,
        0.89655172, 0.98275862, 1.        ]])

Each of the values in the normalized array are now between 0 and 1.

Example 2: Normalize All Variables in Pandas DataFrame

The following code shows how to normalize all variables in a pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'points': [25, 12, 15, 14, 19, 23, 25, 29],
                   'assists': [5, 7, 7, 9, 12, 9, 9, 4],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, 12]})

#normalize values in every column
df_norm = (df-df.min())/ (df.max() - df.min())

#view normalized DataFrame
df_norm

        points	        assists	 rebounds
0	0.764706	0.125	 0.857143
1	0.000000	0.375	 0.428571
2	0.176471	0.375	 0.714286
3	0.117647	0.625	 0.142857
4	0.411765	1.000	 0.142857
5	0.647059	0.625	 0.000000
6	0.764706	0.625	 0.571429
7	1.000000	0.000	 1.000000

Each of the values in every column are now between 0 and1.

Example 3: Normalize Specific Variables in Pandas DataFrame

The following code shows how to normalize a specific variables in a pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'points': [25, 12, 15, 14, 19, 23, 25, 29],
                   'assists': [5, 7, 7, 9, 12, 9, 9, 4],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, 12]})

define columns to normalize
x = df.iloc[:,0:2]

#normalize values in first two columns only 
df.iloc[:,0:2] = (x-x.min())/ (x.max() - x.min())

#view normalized DataFrame 
df

	points	        assists	 rebounds
0	0.764706	0.125	 11
1	0.000000	0.375	 8
2	0.176471	0.375	 10
3	0.117647	0.625	 6
4	0.411765	1.000	 6
5	0.647059	0.625	 5
6	0.764706	0.625	 9
7	1.000000	0.000	 12

Notice that just the values in the first two columns are normalized.

Additional Resources

The following tutorials provide additional information on normalizing data:

How to Normalize Data Between 0 and 1
How to Normalize Data Between 0 and 100
Standardization vs. Normalization: What’s the Difference?

Leave a Reply

Your email address will not be published.