How to Calculate Residuals in Regression Analysis

R Guides

This tutorial explains how to calculate residuals in regression analysis.

A Brief Overview of Simple Linear Regression

Simple linear regression is a statistical method you can use to understand the relationship between two variables, x and y. One variable, x, is known as the predictor variableThe other variable, y, is known as the response variable.

For example, suppose we have the following dataset with the weight and height of seven individuals:

Simple linear regression

Let weight be the predictor variable and let height be the response variable.

If we graph these two variables using a scatterplot, with weight on the x-axis and height on the y-axis, here’s what it would look like:

Scatterplot example

From the scatterplot we can clearly see that as weight increases, height tends to increase as well, but to actually quantify this relationship between weight and height, we need to use linear regression.

Using linear regression, we can find the line that best “fits” our data:

Trend line on scatterplot in Excel

The formula for this line of best fit is written as:

ŷ = b0 + b1x

where ŷ is the predicted value of the response variable, b0 is the y-intercept, b1 is the regression coefficient, and x is the value of the predictor variable.

In this example, the line of best fit is:

height = 32.783 + 0.2001*(weight)

How to Calculate Residuals

Notice that the data points in our scatterplot don’t always fall exactly on the line of best fit:

Trend line on scatterplot in Excel

This difference between the data point and the line is called the residual. For each data point, we can calculate that point’s residual by taking the difference between it’s actual value and the predicted value from the line of best fit.

Example 1: Calculating a Residual

For example, recall the weight and height of the seven individuals in our dataset:

Simple linear regression

The first individual has a weight of 140 lbs. and a height of 60 inches.

To find out the predicted height for this individual, we can plug their weight into the line of best fit equation:

height = 32.783 + 0.2001*(weight)

Thus, the predicted height of this individual is:

height = 32.783 + 0.2001*(140)

height = 60.797 inches

Thus, the residual for this data point is 60 – 60.797 = -0.797.

Example 2: Calculating a Residual

We can use the exact same process we used above to calculate the residual for each data point. For example, let’s calculate the residual for the second individual in our dataset:

Simple linear regression

The second individual has a weight of 155 lbs. and a height of 62 inches.

To find out the predicted height for this individual, we can plug their weight into the line of best fit equation:

height = 32.783 + 0.2001*(weight)

Thus, the predicted height of this individual is:

height = 32.783 + 0.2001*(155)

height = 63.7985 inches

Thus, the residual for this data point is 62 – 63.7985 = -1.7985.

Calculating All Residuals

Using the same method as the previous two examples, we can calculate the residuals for every data point:

Notice that some of the residuals are positive and some are negative. If we add up all of the residuals, they will add up to zero. This is because linear regression finds the line that minimizes the total squared residuals, which is why the line perfectly goes through the data, with some of the data points lying above the line and some lying below the line.

Visualizing Residuals

Recall that a residual is simply the distance between the actual data value and the value predicted by the regression line of best fit. Here’s what those distances look like visually on a scatterplot:

Notice that some of the residuals are larger than others. Also, some of the residuals are positive and some are negative as we mentioned earlier.

Creating a Residual Plot

The whole point of calculating residuals is to see how well the regression line fits the data. Larger residuals indicate that the regression line is a poor fit for the data, i.e. the actual data points do not fall close to the regression line. Smaller residuals indicate that the regression line fits the data better, i.e. the actual data points fall close to the regression line.

One useful type of plot to visualize all of the residuals at once is a residual plot. A residual plot is a type of plot that displays the predicted values against the residual values for a regression model. This type of plot is often used to assess whether or not a linear regression model is appropriate for a given dataset and to check for heteroscedasticity of residuals.

Check out this tutorial to find out how to create a residual plot for a simple linear regression model in Excel.

Leave a Reply

Your email address will not be published. Required fields are marked *