A Simple Guide to Polynomial Regression in R

Polynomial regression in R

The whole point of regression analysis is to understand the relationship between one or more predictor variables and a response variable. 

For example, we may be interested in the relationship between number of hours studied (predictor variable) and final exam score (response variable) for a certain college course. 

Or we may be interested in the relationship between age (predictor variable) and height (response variable) for a certain species of plant.

If we only have one predictor variable and one response variable, then we can conduct a simple linear regression. Before we create a regression model, it often make sense to create a scatterplot of the data to see if there is actually a linear relationship between the predictor and response variable. If the relationship is linear, then it makes sense to fit a linear regression to the dataset. 

For example, suppose we have the following dataset that shows the number of hours studied and final exam score for a class of 50 students:

#set seed to make this example reproducible
set.seed(1)

#create data frame with two variables: study_hours and final_score
data <- data.frame(study_hours = runif(50, 5, 15), final_score = 50)
data$final_score = data$final_score + data$study_hours*runif(50, 2, 3)

#view first six rows of data
head(data)

#  study_hours final_score
#1    7.655087    68.96639
#2    8.721239    74.95329
#3   10.728534    76.15721
#4   14.082078    81.61141
#5    7.016819    64.52958
#6   13.983897    79.35872

#plot data with study_hours on x-axis and final_score on y-axis
plot(data$study_hours, data$final_score) #scatterplot
abline(lm(final_score ~ study_hours, data = data), col = 'red') #regression line

Scatterplot with trendline in R

We can see from the scatterplot that there appears to be a fairly strong positive linear relationship between study hours and final exam score: the more hours a student studies, the higher score they tend to receive. 

The simple linear regression line in the plot above appears to fit this dataset quite well, which makes sense considering the true underlying relationship between study hours and final score is linear.

To see just how well a linear regression model fits this data, we can view the output of the model:

#fit linear regression model
model <- lm(final_score ~ study_hours, data = data)

#view model output
summary(model)

#Call:
#lm(formula = final_score ~ study_hours, data = data)
#
#Residuals:
#    Min      1Q  Median      3Q     Max 
#-5.8436 -2.1007 -0.3953  2.5287  4.6089 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept)  49.5840     1.5566   31.85   <2e-16 ***
#study_hours   2.5471     0.1459   17.46   <2e-16 ***
#---
#Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#Residual standard error: 2.78 on 48 degrees of freedom
#Multiple R-squared:  0.864,	Adjusted R-squared:  0.8612 
#F-statistic: 304.9 on 1 and 48 DF,  p-value: < 2.2e-16

We can see that the value for Multiple R-squared is 0.864. This number represents the proportion of the variance in the response variable that can be explained by the predictor variable. Thus, 86.4% of the variance in the final exam scores can be explained by the number of hours studied.

In addition, the value for the residual standard error is 2.78. This number represents the average distance that the observed values fall from the regression line. Thus, in this example the observed values fall an average of just 2.78 units from the regression line. 

Since the multiple r-squared is quite high and the residual standard error is quite low, this tells us that this simple linear regression model fits the data well.

Related: What is a Good R-squared Value?

Nonlinear Relationships

However, not all predictor and response variables have a linear relationship with one another. For example, suppose the relationship between age and height for a certain species of plant is nonlinear.

When the plant is young, it grows very slowly. Then, as it gets older, it experiences a burst of growth. Then, as it continues to age, it starts to grow very slowly once again.

The following dataset shows this type of relationship between age and height for a certain species of plant:

#set seed to make this example reproducible 
set.seed(1)

#create data frame with two variables: age (days) and height (cm.)
data <- data.frame(age = round(runif(100, 1, 30), 0), height = 5)
data$height = data$height + ((data$age)^3)*runif(100, 2, 3)

#view first six rows of data
head(data)

#  age     height
#1   9  1940.2937
#2  12  4071.3249
#3  18 13245.1572
#4  27 58910.0004
#5   7   908.2882
#6  27 43567.5757

#plot data with age on x-axis and height on y-axis 
plot(data$age, data$height) #scatterplot
abline(lm(height ~ age, data = data), col = 'red') #regression line

Scatterplot with nonlinear relationship in R

Clearly the relationship between age and height is not linear and we can see that the linear regression line fits the data very poorly.

To see just how poorly this model fits the data, we can view the model output:

#fit linear regression model
model <- lm(height ~ age, data = data)

#view model output
summary(model)

#Call:
#lm(formula = height ~ age, data = data)
#
#Residuals:
#   Min     1Q Median     3Q    Max 
# -9678  -5639  -1689   3489  25296 
#
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
#(Intercept) -17012.52    1688.95  -10.07   <2e-16 ***
#age           2156.03      94.96   22.70   <2e-16 ***
#---
#Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#Residual standard error: 7318 on 98 degrees of freedom
#Multiple R-squared:  0.8403,	Adjusted R-squared:  0.8386 
#F-statistic: 515.5 on 1 and 98 DF,  p-value: < 2.2e-16

We can see that the value for Multiple R-squared is 0.8403. This means that 84.03% of the variance in the plant height can be explained by the plant age.

Although the multiple r-squared seems quite high, it can be deceiving. To get a better understanding of the fit of the model, we should look at the value for the residual standard error, which is 7318. This tells us that the observed values fall an average of 7,318 units from the regression line. This number is quite high and if we were to use this model to make predictions, we likely wouldn’t be very accurate. 

Related: Understanding the Standard Error of the Regression

In addition, we can visually see that the regression line does not fit the data well in the plot and that the true underlying relationship between age and height is not exactly linear.

A simple plot of the residual vs. fitted values of the model also reveals that there is a clear pattern in the residuals, which is another indication that the model does not fit the data well:

plot(fitted(model),residuals(model))

Residual vs fitted plot for linear regression in R

Thus, since the true relationship between age and height is nonlinear, we can try to fit a polynomial regression model to the data, which will tend to fit the curves of the data much better than a linear model.

Introducing Polynomial Regression

Instead of using a straight line to describe the relationship between the predictor and response variable, a polynomial regression allows us to fit a nonlinear line to the data. 

In this example in particular, recall that we used the following line to generate the height values:

data$height = data$height + ((data$age)^3)*runif(100, 2, 3) 

Essentially we raised the age variable to the power of three, then added a bit of “noise” to the data by  multiplying by a random number between 2 and 3.

This means a third-order polynomial regression model is likely to fit the data better than a linear regression model.

There are two ways to fit a polynomial regression in R:

poly_model <- lm(height ~ poly(age,3), data = data)

or

poly_model <- lm(height ~ age + I(age^2) + I(age^3), data = data) 

However, age, I(age^2), and I(age^3) will be correlated and correlated variables can cause problems in regression. By contrast, poly() lets you avoid this problem by producing orthogonal polynomials, so we’ll use that approach instead.

#fit polynomial regression model
poly_model <- lm(height ~ poly(age,3), data = data)

#view model output
summary(poly_model)

#Call:
#lm(formula = height ~ poly(age, 3), data = data)
#
#Residuals:
#     Min       1Q   Median       3Q      Max 
#-11596.4   -666.4    134.9    625.9   9944.6 
#
#Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
#(Intercept)    17548.6      319.7  54.889  < 2e-16 ***
#poly(age, 3)1 166152.7     3197.1  51.970  < 2e-16 ***
#poly(age, 3)2  64684.8     3197.1  20.232  < 2e-16 ***
#poly(age, 3)3   9091.1     3197.1   2.844  0.00545 ** 
#---
#Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#Residual standard error: 3197 on 96 degrees of freedom
#Multiple R-squared:  0.9701,	Adjusted R-squared:  0.9692 
#F-statistic:  1039 on 3 and 96 DF,  p-value: < 2.2e-16

We can see that the value for Multiple R-squared is 0.9701, which is much higher than our linear model. We can also see that the value for Residual standard error is 3,197, which is much less than the value of 7,318 from our linear model.

Our third-order polynomial regression model has provided a substantial improvement in fit over the linear model.

Cautions on Overfitting

When using polynomial regression, it’s important to avoid overfitting the data. In general, higher order polynomials will always produce tighter fits and higher r-squared values than lower order polynomials.

Unfortunately, after a certain point this can cause the model to fit the noise (or “randomness”) of the data, rather than the true underlying relationship between the predictor and the response variable. This can cause problems when using the model to make predictions and the results may be misleading. 

The key with polynomial regression is to make an improvement over a simple linear model while also avoiding the pitfalls that come with overfitting.

Leave a Reply

Your email address will not be published. Required fields are marked *