A linear regression model can be useful for two things:

**(1)** Quantifying the relationship between one or more predictor variables and a response variable.

**(2) **Using the model to predict future values.

In regards to **(2)**, when we use a regression model to predict future values, we are often interested in predicting both an *exact value *as well as an *interval *that contains a range of likely values. This interval is known as a **prediction interval**.

For example, suppose we fit a simple linear regression model using *hours studied *as a predictor variable and *exam score *as the response variable. Using this model, we might predict that a student who studies for 6 hours will receive an exam score of **91**.

However, because there is uncertainty around this prediction, we might create a prediction interval that says there is a 95% chance that a student who studies for 6 hours will receive an exam score between **85** and **97**. This range of values is known as a 95% prediction interval and it’s often more useful to us than just knowing the exact predicted value.

**How to Create a Prediction Interval in R**

To illustrate how to create a prediction interval in R, we will use the built-in *mtcars *dataset, which contains information about characteristics of several different cars:

#view first six rows ofmtcarshead(mtcars) # mpg cyl disp hp drat wt qsec vs am gear carb #Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 #Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 #Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

First, we’ll fit a simple linear regression model using *disp *as the predictor variable and *mpg *as the response variable.

#fit simple linear regression model model <- lm(mpg ~ disp, data = mtcars) #view summary of fitted model summary(model) #Call: #lm(formula = mpg ~ disp, data = mtcars) # #Residuals: # Min 1Q Median 3Q Max #-4.8922 -2.2022 -0.9631 1.6272 7.2305 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 29.599855 1.229720 24.070 < 2e-16 *** #disp -0.041215 0.004712 -8.747 9.38e-10 *** #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # #Residual standard error: 3.251 on 30 degrees of freedom #Multiple R-squared: 0.7183, Adjusted R-squared: 0.709 #F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10

Then, we’ll use the fitted regression model to predict the value of *mpg *based on three new values for *disp*.

#create data frame with three new values fordispnew_disp <- data.frame(disp= c(150, 200, 250)) #use the fitted model to predict the value formpgbased on the three new values #fordisppredict(model, newdata = new_disp) # 1 2 3 #23.41759 21.35683 19.29607

The way to interpret these values is as follows:

- For a new car with a
*disp*of 150, we predict that it will have a*mpg*of**23.41759**. - For a new car with a
*disp*of 200, we predict that it will have a*mpg*of**21.35683**. - For a new car with a
*disp*of 250, we predict that it will have a*mpg*of**19.29607**.

Next, we’ll use the fitted regression model to make prediction intervals around these predicted values:

#create prediction intervals around the predicted values predict(model, newdata = new_disp, interval = "predict") # fit lwr upr #1 23.41759 16.62968 30.20549 #2 21.35683 14.60704 28.10662 #3 19.29607 12.55021 26.04194

The way to interpret these values is as follows:

- The 95% prediction interval of the
*mpg*for a car with a*disp*of 150 is between**16.62968**and**30.20549**. - The 95% prediction interval of the
*mpg*for a car with a*disp*of 200 is between**14.60704**and**28.10662**. - The 95% prediction interval of the
*mpg*for a car with a*disp*of 250 is between**12.55021**and**26.04194**.

By default, R uses a 95% prediction interval. However, we can change this to whatever we’d like using the **level **command. For example, the following code illustrates how to create 99% prediction intervals:

#create 99% prediction intervals around the predicted values predict(model, newdata = new_disp, interval = "predict", level = 0.99) # fit lwr upr #1 23.41759 14.27742 32.55775 #2 21.35683 12.26799 30.44567 #3 19.29607 10.21252 28.37963

Note that the 99% prediction intervals are wider than the 95% prediction intervals. This makes sense because the wider the interval, the higher the likelihood that it will contain the predicted value.

**How to Visualize a Prediction Interval in R**

The following code illustrates how to create a chart with the following features:

- A scatterplot of the data points for
*disp*and*mpg* - A blue line for the fitted regression line
- Gray confidence bands
- Red prediction bands

#define dataset data <- mtcars[ , c("mpg", "disp")] #create simple linear regression model model <- lm(mpg ~ disp, data = mtcars) #use model to create prediction intervals predictions <- predict(model, interval = "predict") #create dataset that contains original data along with prediction intervals all_data <- cbind(data, predictions) #loadggplot2library library(ggplot2) #create plot ggplot(all_data, aes(x = disp, y = mpg)) + #define x and y axis variables geom_point() + #add scatterplot points stat_smooth(method = lm) + #confidence bands geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") + #lwr pred interval geom_line(aes(y = upr), col = "coral2", linetype = "dashed") #upr pred interval

**When to Use a Confidence Interval vs. a Prediction Interval**

A **prediction interval **captures the uncertainty around a single value. A **confidence interval **captures the uncertainty around the mean predicted values. Thus, a prediction interval will always be wider than a confidence interval for the same value.

You should use a prediction interval when you are interested in specific individual predictions because a confidence interval will produce too narrow of a range of values, resulting in a greater chance that the interval will not contain the true value.

how to calculate prediction interval for several variable at the same time?

How do you do that for longitudinal data using glmgee models?