How to Create a Histogram of Residuals in R


One of the main assumptions of linear regression is that the residuals are normally distributed.

One way to visually check this assumption is to create a histogram of the residuals and observe whether or not the distribution follows a “bell-shape” reminiscent of the normal distribution.

This tutorial provides a step-by-step example of how to create a histogram of residuals for a regression model in R.

Step 1: Create the Data

First, let’s create some fake data to work with:

#make this example reproducible
set.seed(0)

#create data
x1 <- rnorm(n=100, 2, 1)
x2 <- rnorm(100, 4, 3)
y  <- rnorm(100, 2, 3)
data <- data.frame(x1, x2, y)

#view first six rows of data
head(data)

        x1        x2          y
1 3.262954 6.3455776 -1.1371530
2 1.673767 1.6696701 -0.6886338
3 3.329799 2.1520303  5.8081615
4 3.272429 4.1397409  3.7815228
5 2.414641 0.6088427  4.3269030
6 0.460050 5.7301563  6.6721111

Step 2: Fit the Regression Model

Next, we’ll fit a multiple linear regression model to the data:

#fit multiple linear regression model
model <- lm(y ~ x1 + x2, data=data)

Step 3: Create a Histogram of Residuals

Lastly, we’ll use the ggplot visualization package to create a histogram of the residuals from the model:

#load ggplot2
library(ggplot2)

#create histogram of residuals
ggplot(data = data, aes(x = model$residuals)) +
    geom_histogram(fill = 'steelblue', color = 'black') +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

Histogram of residuals in R

Note that we can also specify the number of bins to place the residuals in by using the bin argument.

The fewer the bins, the wider the bars will be in the histogram. For example, we could specify 20 bins:

#create histogram of residuals
ggplot(data = data, aes(x = model$residuals)) +
    geom_histogram(bins = 20, fill = 'steelblue', color = 'black') +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

Residual histogram in R

Or we could specify 10 bins:

#create histogram of residuals
ggplot(data = data, aes(x = model$residuals)) +
    geom_histogram(bins = 10, fill = 'steelblue', color = 'black') +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

No matter how many bins we specify, we can see that the residuals are roughly normally distributed.

We could also perform a formal statistical test like the Shapiro-Wilk, Kolmogorov-Smirnov, or Jarque-Bera to test for normality.

However, keep in mind that these tests are sensitive to large sample sizes – that is, they often conclude that the residuals are not normal when the sample size is large.

For this reason, it’s often easier to assess normality by creating a histogram of the residuals.

Leave a Reply

Your email address will not be published.