# How to Perform a Box-Cox Transformation in R (With Examples)

box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:

• y(λ) = (yλ – 1) / λ  if y ≠ 0
• y(λ) = log(y)  if y = 0

We can perform a box-cox transformation in R by using the boxcox() function from the MASS() library. The following example shows how to use this function in practice.

Refer to this paper from the University of Connecticut for a nice summary of the development of the Box-Cox transformation.

### Example: Box-Cox Transformation in R

The following code shows how to fit a linear regression model to a dataset, then use the boxcox() function to find an optimal lambda to transform the response variable and fit a new model.

```library(MASS)

#create data
y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8)
x=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8)

#fit linear regression model
model <- lm(y~x)

#find optimal lambda for Box-Cox transformation
bc <- boxcox(y ~ x)
(lambda <- bc\$x[which.max(bc\$y)])

 -0.4242424

#fit new linear regression model using the Box-Cox transformation
new_model <- lm(((y^lambda-1)/lambda) ~ x)
```

The optimal lambda was found to be -0.4242424. Thus, the new regression model replaced the original response variable y with the variable y = (y-0.4242424 – 1) / -0.4242424.

The following code shows how to create two Q-Q plots in R to visualize the differences in residuals between the two regression models:

```#define plotting area
op <- par(pty = "s", mfrow = c(1, 2))

#Q-Q plot for original model
qqnorm(model\$residuals)
qqline(model\$residuals)

#Q-Q plot for Box-Cox transformed model
qqnorm(new_model\$residuals)
qqline(new_model\$residuals)

#display both Q-Q plots
par(op)
``` As a rule of thumb, if the data points fall along a straight diagonal line in a Q-Q plot then the dataset likely follows a normal distribution.

Notice how the box-cox transformed model produces a Q-Q plot with a much straighter line than the original regression model.

This is an indication that the residuals of the box-cox transformed model are much more normally distributed, which satisfies one of the assumptions of linear regression.