The whole point of regression analysis is to understand the relationship between one or more predictor variables and a response variable.

For example, we may be interested in the relationship between *number of hours studied *(predictor variable) and *final exam score *(response variable) for a certain college course.

Or we may be interested in the relationship between *age *(predictor variable) and *height *(response variable) for a certain species of plant.

If we only have one predictor variable and one response variable, then we can conduct a simple linear regression. Before we create a regression model, it often make sense to create a scatterplot of the data to see if there is actually a linear relationship between the predictor and response variable. If the relationship is linear, then it makes sense to fit a linear regression to the dataset.

For example, suppose we have the following dataset that shows the *number of hours studied *and *final exam score* for a class of 50 students:

#set seed to make this example reproducible set.seed(1) #create data frame with two variables:study_hoursandfinal_scoredata <- data.frame(study_hours = runif(50, 5, 15), final_score = 50) data$final_score = data$final_score + data$study_hours*runif(50, 2, 3) #view first six rows of data head(data) # study_hours final_score #1 7.655087 68.96639 #2 8.721239 74.95329 #3 10.728534 76.15721 #4 14.082078 81.61141 #5 7.016819 64.52958 #6 13.983897 79.35872 #plot data withstudy_hourson x-axis andfinal_scoreon y-axis plot(data$study_hours, data$final_score) #scatterplot abline(lm(final_score ~ study_hours, data = data), col = 'red') #regression line

We can see from the scatterplot that there appears to be a fairly strong positive linear relationship between study hours and final exam score: the more hours a student studies, the higher score they tend to receive.

The simple linear regression line in the plot above appears to fit this dataset quite well, which makes sense considering the true underlying relationship between *study hours *and *final score *is linear.

To see just how well a linear regression model fits this data, we can view the output of the model:

#fit linear regression model model <- lm(final_score ~ study_hours, data = data)#view model output summary(model) #Call: #lm(formula = final_score ~ study_hours, data = data) # #Residuals: # Min 1Q Median 3Q Max #-5.8436 -2.1007 -0.3953 2.5287 4.6089 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 49.5840 1.5566 31.85 <2e-16 *** #study_hours 2.5471 0.1459 17.46 <2e-16 *** #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # #Residual standard error: 2.78 on 48 degrees of freedom #Multiple R-squared: 0.864, Adjusted R-squared: 0.8612 #F-statistic: 304.9 on 1 and 48 DF, p-value: < 2.2e-16

We can see that the value for **Multiple R-squared **is 0.864. This number represents the proportion of the variance in the response variable that can be explained by the predictor variable. Thus, 86.4% of the variance in the final exam scores can be explained by the number of hours studied.

In addition, the value for the **residual standard error** is 2.78. This number represents the average distance that the observed values fall from the regression line. Thus, in this example the observed values fall an average of just 2.78 units from the regression line.

Since the multiple r-squared is quite high and the residual standard error is quite low, this tells us that this simple linear regression model fits the data well.

**Related: What is a Good R-squared Value?**

**Nonlinear Relationships**

However, not all predictor and response variables have a linear relationship with one another. For example, suppose the relationship between *age *and *height *for a certain species of plant is nonlinear.

When the plant is young, it grows very slowly. Then, as it gets older, it experiences a burst of growth. Then, as it continues to age, it starts to grow very slowly once again.

The following dataset shows this type of relationship between *age *and *height *for a certain species of plant:

#set seed to make this example reproducible set.seed(1) #create data frame with two variables:age (days)andheight (cm.)data <- data.frame(age = round(runif(100, 1, 30), 0), height = 5) data$height = data$height + ((data$age)^3)*runif(100, 2, 3) #view first six rows of data head(data) # age height #1 9 1940.2937 #2 12 4071.3249 #3 18 13245.1572 #4 27 58910.0004 #5 7 908.2882 #6 27 43567.5757 #plot data withageon x-axis andheighton y-axis plot(data$age, data$height) #scatterplot abline(lm(height ~ age, data = data), col = 'red') #regression line

Clearly the relationship between age and height is not linear and we can see that the linear regression line fits the data very poorly.

To see just how poorly this model fits the data, we can view the model output:

#fit linear regression model model <- lm(height ~ age, data = data) #view model output summary(model) #Call: #lm(formula = height ~ age, data = data) # #Residuals: # Min 1Q Median 3Q Max # -9678 -5639 -1689 3489 25296 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -17012.52 1688.95 -10.07 <2e-16 *** #age 2156.03 94.96 22.70 <2e-16 *** #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # #Residual standard error: 7318 on 98 degrees of freedom #Multiple R-squared: 0.8403, Adjusted R-squared: 0.8386 #F-statistic: 515.5 on 1 and 98 DF, p-value: < 2.2e-16

We can see that the value for **Multiple R-squared **is 0.8403. This means that 84.03% of the variance in the plant height can be explained by the plant age.

Although the multiple r-squared seems quite high, it can be deceiving. To get a better understanding of the fit of the model, we should look at the value for the **residual standard error**, which is 7318. This tells us that the observed values fall an average of 7,318 units from the regression line. This number is quite high and if we were to use this model to make predictions, we likely wouldn’t be very accurate.

**Related: Understanding the Standard Error of the Regression**

In addition, we can visually see that the regression line does not fit the data well in the plot and that the true underlying relationship between age and height is not exactly linear.

A simple plot of the residual vs. fitted values of the model also reveals that there is a clear pattern in the residuals, which is another indication that the model does not fit the data well:

**plot(fitted(model),residuals(model))**

Thus, since the true relationship between age and height is nonlinear, we can try to fit a polynomial regression model to the data, which will tend to fit the curves of the data much better than a linear model.

**Introducing Polynomial Regression**

Instead of using a straight line to describe the relationship between the predictor and response variable, a polynomial regression allows us to fit a nonlinear line to the data.

In this example in particular, recall that we used the following line to generate the height values:

data$height = data$height + ((data$age)^3)*runif(100, 2, 3)

Essentially we raised the *age *variable to **the power of three**, then added a bit of “noise” to the data by multiplying by a random number between 2 and 3.

This means a** third-order** polynomial regression model is likely to fit the data better than a linear regression model.

There are two ways to fit a polynomial regression in R:

poly_model <- lm(height ~ poly(age,3), data = data)

or

poly_model <- lm(height ~ age + I(age^2) + I(age^3), data = data)

However, **age**, **I(age^2)**, and **I(age^3) **will be correlated and correlated variables can cause problems in regression. By contrast, **poly()** lets you avoid this problem by producing orthogonal polynomials, so we’ll use that approach instead.

#fit polynomial regression model poly_model <- lm(height ~ poly(age,3), data = data) #view model output summary(poly_model) #Call: #lm(formula = height ~ poly(age, 3), data = data) # #Residuals: # Min 1Q Median 3Q Max #-11596.4 -666.4 134.9 625.9 9944.6 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 17548.6 319.7 54.889 < 2e-16 *** #poly(age, 3)1 166152.7 3197.1 51.970 < 2e-16 *** #poly(age, 3)2 64684.8 3197.1 20.232 < 2e-16 *** #poly(age, 3)3 9091.1 3197.1 2.844 0.00545 ** #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # #Residual standard error: 3197 on 96 degrees of freedom #Multiple R-squared: 0.9701, Adjusted R-squared: 0.9692 #F-statistic: 1039 on 3 and 96 DF, p-value: < 2.2e-16

We can see that the value for **Multiple R-squared **is 0.9701, which is much higher than our linear model. We can also see that the value for **Residual standard error **is 3,197, which is much less than the value of 7,318 from our linear model.

Our third-order polynomial regression model has provided a substantial improvement in fit over the linear model.

**Cautions on Overfitting**

When using polynomial regression, it’s important to avoid overfitting the data. In general, higher order polynomials will always produce tighter fits and higher r-squared values than lower order polynomials.

Unfortunately, after a certain point this can cause the model to fit the noise (or “randomness”) of the data, rather than the true underlying relationship between the predictor and the response variable. This can cause problems when using the model to make predictions and the results may be misleading.

The key with polynomial regression is to make an improvement over a simple linear model while also avoiding the pitfalls that come with overfitting.