A Simple Guide to Multiple Linear Regression in R

A guide to multiple linear regression in R

This guide walks through an example of how to conduct multiple linear regression in R, including:

  • Examining the data before fitting the model
  • Fitting the model
  • Checking the assumptions of the model
  • Interpreting the output of the model
  • Assessing the goodness of fit of the model
  • Using the model to make predictions

Let’s jump in!

Setup

For this example we will use the built-in R dataset mtcars, which contains information about various attributes for 32 different cars:

#view first six lines of mtcars
head(mtcars)

#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
#Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In this example we will build a multiple linear regression model that uses mpg as the response variable and disphp, and drat as the predictor variables.

#create new data frame that contains only the variables we would like to use to
#fit the regression model
data <- mtcars[ , c("mpg", "disp", "hp", "drat")]

#view first six rows of new data frame
head(data)

#                   mpg disp  hp drat
#Mazda RX4         21.0  160 110 3.90
#Mazda RX4 Wag     21.0  160 110 3.90
#Datsun 710        22.8  108  93 3.85
#Hornet 4 Drive    21.4  258 110 3.08
#Hornet Sportabout 18.7  360 175 3.15
#Valiant           18.1  225 105 2.76

Examining the Data

Before we fit the model, we can examine the data to gain a better understanding of it and also visually assess whether or not multiple linear regression could be a good model to fit to this data.

In particular, we need to check if the predictor variables have a linear association with the response variable, which would indicate that a multiple linear regression model may be suitable.

To do so, we can use the pairs() function to create a scatterplot of every possible pair of variables:

pairs(data, pch = 18, col = "steelblue")

Pairs plot in R

From this pairs plot we can see the following:

  • mpg and disp appear to have a strong negative linear correlation
  • mpg and hp appear to have a strong positive linear correlation
  • mpg and drat appear to have a modest negative linear correlation

Note that we could also use the ggpairs() function from the GGally library to create a similar plot that contains the actual linear correlation coefficients for each pair of variables:

#install and load the GGally library
install.packages("GGally")
library(GGally)

#generate the pairs plot
ggpairs(data)

Pairs plot in R

Each of the predictor variables appears to have a noticeable linear correlation with the response variable mpg, so we’ll proceed to fit the linear regression model to the data.

Fitting the Model

The basic syntax to fit a multiple linear regression model in R is as follows:

lm(response_variable ~ predictor_variable1 + predictor_variable2 + ..., data = data)

Using our data, we can fit the model using the following code:

model <- lm(mpg ~ disp + hp + drat, data = data)

Checking Assumptions of the Model

Before we proceed to check the output of the model, we need to first check that the model assumptions are met. Namely, we need to verify the following:

1. The distribution of model residuals should be approximately normal.

We can check if this assumption is met by creating a simple histogram of residuals:

hist(residuals(model), col = "steelblue")

Histogram of residuals in R

Although the distribution is slightly right skewed, it isn’t abnormal enough to cause any major concerns.

2. The variance of the residuals should be consistent for all observations.

This preferred condition is known as homoskedasticity. Violation of this assumption is known as heteroskedasticity. To check if this assumption is met we can create a fitted value vs. residual plot:

#create fitted value vs residual plot
plot(fitted(model), residuals(model))

#add horizontal line at 0
abline(h = 0, lty = 2)

Fitted values vs residuals plot in R

Ideally we would like the residuals to be equally scattered at every fitted value. We can see from the plot that the scatter tends to become a bit larger for larger fitted values, but this pattern isn’t extreme enough to cause too much concern.

Interpreting the Output of the Model

Once we’ve verified that the model assumptions are sufficiently met, we can look at the output of the model using the summary() function:

summary(model)

#Call:
#lm(formula = mpg ~ disp + hp + drat, data = data)
#
#Residuals:
#    Min      1Q  Median      3Q     Max 
#-5.1225 -1.8454 -0.4456  1.1342  6.4958 
#
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)   
#(Intercept) 19.344293   6.370882   3.036  0.00513 **
#disp        -0.019232   0.009371  -2.052  0.04960 * 
#hp          -0.031229   0.013345  -2.340  0.02663 * 
#drat         2.714975   1.487366   1.825  0.07863 . 
#---
#Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#Residual standard error: 3.008 on 28 degrees of freedom
#Multiple R-squared:  0.775,	Adjusted R-squared:  0.7509 
#F-statistic: 32.15 on 3 and 28 DF,  p-value: 3.28e-09

From the output we can see the following:

  • The overall F-statistic of the model is 32.15 and the corresponding p-value is 3.28e-09. This indicates that the overall model is statistically significant. In other words, the regression model as a whole is useful.
  • disp is statistically significant at the 0.10 significance level. In particular, the coefficient from the model output tells is that a one unit increase in disp is associated with a -0.019 unit decrease, on average, in mpg, assuming hp and drat are held constant.
  • hp is statistically significant at the 0.10 significance level. In particular, the coefficient from the model output tells is that a one unit increase in hp is associated with a -0.031 unit decrease, on average, in mpg, assuming disp and drat are held constant.
  • drat is statistically significant at the 0.10 significance level. In particular, the coefficient from the model output tells is that a one unit increase in drat is associated with a 2.715 unit increase, on average, in mpg, assuming disp and hp are held constant.

Assessing the Goodness of Fit of the Model

To assess how “good” the regression model fits the data, we can look at a couple different metrics:

1. Multiple R-Squared

This  measures the strength of the linear relationship between the predictor variables and the response variable. A multiple R-squared of 1 indicates a perfect linear relationship while a multiple R-squared of 0 indicates no linear relationship whatsoever.

Multiple R is also the square root of R-squared, which is the proportion of the variance in the response variable that can be explained by the predictor variables. In this example, the multiple R-squared is 0.775. Thus, the R-squared is 0.7752 = 0.601. This indicates that 60.1% of the variance in mpg can be explained by the predictors in the model.

Related: What is a Good R-squared Value?

2. Residual Standard Error

This measures the average distance that the observed values fall from the regression line. In this example, the observed values fall an average of 3.008 units from the regression line.

Related: Understanding the Standard Error of the Regression

Using the Model to Make Predictions

From the output of the model we know that the fitted multiple linear regression equation is as follows:

mpghat = -19.343 – 0.019*disp – 0.031*hp + 2.715*drat

We can use this equation to make predictions about what mpg will be for new observations. For example, we can find the predicted value of mpg for a car that has the following attributes:

  • disp = 220
  • hp = 150
  • drat  = 3
#define the coefficients from the model output
intercept <- coef(summary(model))["(Intercept)", "Estimate"]
disp <- coef(summary(model))["disp", "Estimate"]
hp <- coef(summary(model))["hp", "Estimate"]
drat <- coef(summary(model))["drat", "Estimate"]

#use the model coefficients to predict the value for mpg
intercept + disp*220 + hp*150 + drat*3

#[1] 18.57373

For a car with disp = 220,  hp = 150, and drat = 3, the model predicts that the car would have a mpg of 18.57373.

Further Reading:
How to Read and Interpret a Regression Table
A Simple Guide to Understanding the F-Test of Overall Significance in Regression
A Guide to Multicollinearity & VIF in Regression

Leave a Reply

Your email address will not be published. Required fields are marked *