A **box-cox transformation** is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:

- y(λ) = (y
^{λ}– 1) / λ if y ≠ 0 - y(λ) = log(y) if y = 0

We can perform a box-cox transformation in Python by using the scipy.stats.boxcox() function.

The following example shows how to use this function in practice.

**Example: Box-Cox Transformation in Python**

Suppose we generate a random set of 1,000 values that come from an exponential distribution:

#load necessary packages import numpy as np from scipy.stats import boxcox import seaborn as sns #make this example reproducible np.random.seed(0) #generate dataset data = np.random.exponential(size=1000) #plot the distribution of data values sns.distplot(data, hist=False, kde=True)

We can see that the distribution does not appear to be normal.

We can use the **boxcox()** function to find an optimal value of lambda that produces a more normal distribution:

#perform Box-Cox transformation on original data transformed_data, best_lambda = boxcox(data) #plot the distribution of the transformed data values sns.distplot(transformed_data, hist=False, kde=True)

We can see that the transformed data follows much more of a normal distribution.

We can also find the exact lambda value used to perform the Box-Cox transformation:

#display optimal lambda value print(best_lambda) 0.2420131978174143

The optimal lambda was found to be roughly **0.242**.

Thus, each data value was transformed using the following equation:

New = (old^{0.242} – 1) / 0.242

We can confirm this by looking at the values from the original data compared to the transformed data:

#view first five values of original dataset data[0:5] array([0.79587451, 1.25593076, 0.92322315, 0.78720115, 0.55104849]) #view first five values of transformed dataset transformed_data[0:5] array([-0.22212062, 0.23427768, -0.07911706, -0.23247555, -0.55495228])

The first value in the original dataset was **0.79587**. Thus, we applied the following formula to transform this value:

New = (.79587^{0.242} – 1) / 0.242 = **-0.222**

We can confirm that the first value in the transformed dataset is indeed **-0.222**.

**Additional Resources**

How to Create & Interpret a Q-Q Plot in Python

How to Perform a Shapiro-Wilk Test for Normality in Python