A lack of fit test is used to determine whether or not a full regression model offers a significantly better fit to a dataset than some reduced version of the model.
For example, suppose we would like to use number of hours studied to predict exam score for students at a certain college. We may decide to fit the following two regression models:
Full Model: Score = β0 + B1(hours) + B2(hours)2
Reduced Model: Score = β0 + B1(hours)
The following step-by-step example shows how to perform a lack of fit test in R to determine if the full model offers a significantly better fit than the reduced model.
Step 1: Create & Visualize a Dataset
First, we’ll use the following code to create a dataset that contains the number of hours studied and exam score received for 50 students:
#make this example reproducible set.seed(1) #create dataset df <- data.frame(hours = runif(50, 5, 15), score=50) df$score = df$score + df$hours^3/150 + df$hours*runif(50, 1, 2) #view first six rows of data head(df) hours score 1 7.655087 64.30191 2 8.721239 70.65430 3 10.728534 73.66114 4 14.082078 86.14630 5 7.016819 59.81595 6 13.983897 83.60510
Next, we’ll create a scatterplot to visualize the relationship between hours and score:
#load ggplot2 visualization package library(ggplot2) #create scatterplot ggplot(df, aes(x=hours, y=score)) + geom_point()
Step 2: Fit Two Different Models to the Dataset
Next, we’ll fit two different regression models to the dataset:
#fit full model full <- lm(score ~ poly(hours,2), data=df) #fit reduced model reduced <- lm(score ~ hours, data=df)
Step 3: Perform a Lack of Fit Test
Next, we’ll use the anova() command to perform a lack of fit test between the two models:
#lack of fit test anova(full, reduced) Analysis of Variance Table Model 1: score ~ poly(hours, 2) Model 2: score ~ hours Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 368.48 2 48 451.22 -1 -82.744 10.554 0.002144 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The F test-statistic turns out to be 10.554 and the corresponding p-value is 0.002144. Since this p-value is less than .05, we can reject the null hypothesis of the test and conclude that the full model offers a statistically significantly better fit than the reduced model.
Step 4: Visualize the Final Model
Lastly, we can visualize the final model (the full model) relative to the original dataset:
ggplot(df, aes(x=hours, y=score)) + geom_point() + stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1) + xlab('Hours Studied') + ylab('Score')
We can see that the curve of the model fits the data quite well.