How to Perform a Box-Cox Transformation in R (With Examples)


box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:

  • y(λ) = (yλ – 1) / λ  if y ≠ 0
  • y(λ) = log(y)  if y = 0

We can perform a box-cox transformation in R by using the boxcox() function from the MASS() library. The following example shows how to use this function in practice.

Refer to this paper from the University of Connecticut for a nice summary of the development of the Box-Cox transformation.

Example: Box-Cox Transformation in R

The following code shows how to fit a linear regression model to a dataset, then use the boxcox() function to find an optimal lambda to transform the response variable and fit a new model. 

library(MASS)

#create data
y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8)
x=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8)

#fit linear regression model
model <- lm(y~x)

#find optimal lambda for Box-Cox transformation 
bc <- boxcox(y ~ x)
(lambda <- bc$x[which.max(bc$y)])

[1] -0.4242424

#fit new linear regression model using the Box-Cox transformation
new_model <- lm(((y^lambda-1)/lambda) ~ x)

The optimal lambda was found to be -0.4242424. Thus, the new regression model replaced the original response variable y with the variable y = (y-0.4242424 – 1) / -0.4242424.

The following code shows how to create two Q-Q plots in R to visualize the differences in residuals between the two regression models:

#define plotting area
op <- par(pty = "s", mfrow = c(1, 2))

#Q-Q plot for original model
qqnorm(model$residuals)
qqline(model$residuals)

#Q-Q plot for Box-Cox transformed model
qqnorm(new_model$residuals)
qqline(new_model$residuals)

#display both Q-Q plots
par(op)

Box-cox transformed Q-Q plot in R

As a rule of thumb, if the data points fall along a straight diagonal line in a Q-Q plot then the dataset likely follows a normal distribution.

Notice how the box-cox transformed model produces a Q-Q plot with a much straighter line than the original regression model.

This is an indication that the residuals of the box-cox transformed model are much more normally distributed, which satisfies one of the assumptions of linear regression.

Additional Resources

How to Transform Data in R (Log, Square Root, Cube Root)
How to Create & Interpret a Q-Q Plot in R
How to Perform a Shapiro-Wilk Test for Normality in R

4 Replies to “How to Perform a Box-Cox Transformation in R (With Examples)”

  1. Could you clarify if your independent variable was continuous or categorical, and whether the reason you didn’t transform it was because of the type it was?

  2. Hi Zach,
    Very impressive site !

    Found a typo in Box-Cox Transformation in Excel (Step-by-Step).
    For the formula, rather than:
    if y ≠ 0,
    if y = 0;

    it should indicate:
    If lambda ≠ 0, y > 0
    If lambda = 0, y > 0

    Thanks,
    Charlie

  3. In the formula, the condition is over lambda not y:
    {Y(λ) = (Y^λ -1)/ λ, si λ ≠0 ; Y(λ) = Log(Y), si λ=0}

Leave a Reply

Your email address will not be published. Required fields are marked *