How to Check Linear Regression Assumptions in R

Linear regression is a statistical method we can use to understand the relationship between two variables, x and y.

When interpreting the results of a regression model, we must first make sure that four assumptions are met:

1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.

2. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.

3. Homoscedasticity: The residuals have constant variance at every level of x.

4. Normality: The residuals of the model are normally distributed.

If one or more of these assumptions are violated, then the results of the regression model could be unreliable.

In this tutorial we explain how to check each of these assumptions in R, using the built-in mtcars dataset as an example with the mpg variable as the response variable and the disp variable as the predictor variable.

Assumption 1: Linear Relationship

The first assumption of linear regression is that there is a linear relationship between the independent variable, x, and the independent variable, y.

The easiest way to detect if this assumption is met is to create a scatter plot of x vs. y.

We can use the following syntax to do so:

#create scatter plot of disp vs. mpg
plot(mtcars$disp, mtcars$mpg)

This produces the following scatter plot:

R check linearity assumption of linear regression

Since the points in the scatter plot fall roughly along a straight line, the linear relationship assumption is met.

Assumption 2: Independence

The next assumption of linear regression is that the residuals are independent.

The simplest way to test if this assumption is met is to look at a residual time series plot, which is a plot of residuals vs. time.

Ideally, most of the residual autocorrelations should fall within the 95% confidence bands around zero, which are located at about +/- 2-over the square root of n, where n is the sample size.

You can also use the Durbin-Watson test to check this assumption.

Assumption 3: Homoscedasticity

The next assumption of linear regression is that the residuals have constant variance at every level of x.

When this is not the case, the residuals are said to suffer from heteroscedasticity.

The simplest way to detect heteroscedasticity is by creating a fitted value vs. residual plot. 

You can use the following syntax to do so in R:

#fit regression model
model <- lm(mpg ~ disp, data=mtcars)

#create fitted values vs residuals plot
plot(model, 1)

This produces the following plot:

check heteroscedasticity assumption in R

The x-axis shows the fitted values and the y-axis shows the residuals.

We can see that the residuals do seem to get larger over time, but not enough for us to be concerned that heteroscedasticity is present.

Assumption 4: Normality

The next assumption of linear regression is that the residuals are normally distributed. 

The most common way to check this assumption is to create a Q-Q plot.

If the points on the plot roughly form a straight diagonal line, then the normality assumption is met.

We can use the following syntax to create this plot in R:

#fit regression model
model <- lm(mpg ~ disp, data=mtcars)

#create Q-Q plot
plot(model, 2)

This produces the following Q-Q plot:

check normality assumption in R using Q-Q plot

From the plot we can see that the points fall roughly along a straight line, with minor deviations along each tail.

Based on this plot, we would assume that the normality assumption is not violated.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Perform Spline Regression in R
How to Perform OLS Regression in R
How to Perform Power Regression in R

Leave a Reply

Your email address will not be published. Required fields are marked *