How to Perform a Box-Cox Transformation in R (With Examples)


box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:

  • y(λ) = yλ – 1 / y λ if y ≠ 0
  • y(λ) = log(y) if y = 0

We can perform a box-cox transformation in R by using the boxcox() function from the MASS() library. The following example shows how to use this function in practice.

Refer to this paper from the University of Connecticut for a nice summary of the development of the Box-Cox transformation.

Example: Box-Cox Transformation in R

The following code shows how to fit a linear regression model to a dataset, then use the boxcox() function to find an optimal lambda to transform the response variable and fit a new model. 

library(MASS)

#create data
y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8)
x=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8)

#fit linear regression model
model <- lm(y~x)

#find optimal lambda for Box-Cox transformation 
bc <- boxcox(y ~ x)
(lambda <- bc$x[which.max(bc$y)])

[1] -0.4242424

#fit new linear regression model using the Box-Cox transformation
new_model <- lm(((y^lambda-1)/lambda) ~ x)

The optimal lambda was found to be -0.4242424. Thus, the new regression model replaced the original response variable y with the variable y = (y-0.4242424 – 1) / -0.4242424.

The following code shows how to create two Q-Q plots in R to visualize the differences in residuals between the two regression models:

#define plotting area
op <- par(pty = "s", mfrow = c(1, 2))

#Q-Q plot for original model
qqnorm(model$residuals)
qqline(model$residuals)

#Q-Q plot for Box-Cox transformed model
qqnorm(new_model$residuals)
qqline(new_model$residuals)

#display both Q-Q plots
par(op)

Box-cox transformed Q-Q plot in R

As a rule of thumb, if the data points fall along a straight diagonal line in a Q-Q plot then the dataset likely follows a normal distribution.

Notice how the box-cox transformed model produces a Q-Q plot with a much straighter line than the original regression model.

This is an indication that the residuals of the box-cox transformed model are much more normally distributed, which satisfies one of the assumptions of linear regression.

Additional Resources

How to Transform Data in R (Log, Square Root, Cube Root)
How to Create & Interpret a Q-Q Plot in R
How to Perform a Shapiro-Wilk Test for Normality in R

Leave a Reply

Your email address will not be published. Required fields are marked *