How to Transform Data in R (Log, Square Root, Cube Root)


Many statistical tests make the assumption that the residuals of a response variable are normally distributed.

However, often the residuals are not normally distributed. One way to address this issue is to transform the response variable using one of the three transformations:

1. Log Transformation: Transform the response variable from y to log(y).

2. Square Root Transformation: Transform the response variable from y to y.

3. Cube Root Transformation: Transform the response variable from y to y1/3.

By performing these transformations, the response variable typically becomes closer to normally distributed. The following examples show how to perform these transformations in R.

Log Transformation in R

The following code shows how to perform a log transformation on a response variable:

#create data frame
df <- data.frame(y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8),
                 x1=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8),
                 x2=c(3, 3, 6, 6, 8, 9, 9, 8, 8, 7, 4, 3, 3, 2, 7))

#perform log transformation
log_y <- log10(df$y)

The following code shows how to create histograms to view the distribution of before and after performing a log transformation:

#create histogram for original distribution
hist(df$y, col='steelblue', main='Original')

#create histogram for log-transformed distribution 
hist(log_y, col='coral2', main='Log Transformed')

Log transformation of data in R

Notice how the log-transformed distribution is much more normal compared to the original distribution. It’s still not a perfect “bell shape” but it’s closer to a normal distribution that the original distribution.

In fact, if we perform a Shapiro-Wilk test on each distribution we’ll find that the original distribution fails the normality assumption while the log-transformed distribution does not (at α = .05):

#perform Shapiro-Wilk Test on original data
shapiro.test(df$y)

	Shapiro-Wilk normality test

data:  df$y
W = 0.77225, p-value = 0.001655

#perform Shapiro-Wilk Test on log-transformed data 
shapiro.test(log_y)

	Shapiro-Wilk normality test

data:  log_y
W = 0.89089, p-value = 0.06917

Square Root Transformation in R

The following code shows how to perform a square root transformation on a response variable:

#create data frame
df <- data.frame(y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8),
                 x1=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8),
                 x2=c(3, 3, 6, 6, 8, 9, 9, 8, 8, 7, 4, 3, 3, 2, 7))

#perform square root transformation
sqrt_y <- sqrt(df$y)

The following code shows how to create histograms to view the distribution of before and after performing a square root transformation:

#create histogram for original distribution
hist(df$y, col='steelblue', main='Original')

#create histogram for square root-transformed distribution 
hist(sqrt_y, col='coral2', main='Square Root Transformed')

Square root transformation in R

Notice how the square root-transformed distribution is much more normally distributed compared to the original distribution.

Cube Root Transformation in R

The following code shows how to perform a cube root transformation on a response variable:

#create data frame
df <- data.frame(y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8),
                 x1=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8),
                 x2=c(3, 3, 6, 6, 8, 9, 9, 8, 8, 7, 4, 3, 3, 2, 7))

#perform square root transformation
cube_y <- df$y^(1/3)

The following code shows how to create histograms to view the distribution of before and after performing a square root transformation:

#create histogram for original distribution
hist(df$y, col='steelblue', main='Original')

#create histogram for square root-transformed distribution 
hist(cube_y, col='coral2', main='Cube Root Transformed')

Cube root transformation in R

Depending on your dataset, one of these transformations may produce a new dataset that is more normally distributed than the others.

Leave a Reply

Your email address will not be published.