You can use the PROC GLMSELECT statement in SAS to select the best regression model based on a list of potential predictor variables.
The following example shows how to use this statement in practice.
Example: How to Use PROC GLMSELECT in SAS for Model Selection
Suppose we want to fit a multiple linear regression model that uses (1) number of hours spent studying, (2) number of prep exams taken and (3) gender to predict the final exam score of students.
First, we’ll use the following code to create a dataset that contains this information for 20 students:
/*create dataset*/ data exam_data; input hours prep_exams gender $ score; datalines; 1 1 0 76 2 3 1 78 2 3 0 85 4 5 0 88 2 2 0 72 1 2 1 69 5 1 1 94 4 1 0 94 2 0 1 88 4 3 0 92 4 4 1 90 3 3 1 75 6 2 1 96 5 4 0 90 3 4 0 82 4 4 1 85 6 5 1 99 2 1 0 83 1 0 1 62 2 1 0 76 ; run; /*view dataset*/ proc print data=exam_data;
Next, we’ll use the PROC GLMSELECT statement to identify the subset of predictor variables that produces the best regression model:
/*perform model selection*/ proc glmselect data=exam_data; class gender; model score = hours prep_exams gender; run;
Note: We included gender in the class statement because it is a categorical variable.
The first group of tables in the output shows an overview of the GLMSELECT procedure:
We can see that the criterion used to stop adding or removing variables from the model was SBC, which is Schwarz Information Criterion, sometimes called the Bayesian Information Criterion.
Essentially the PROC GLMSELECT statement keeps adding or removing variables from the model until it finds the model with the lowest SBC value, which is considered the “best” model.
The next group of tables shows how the stepwise selection stopped:
We can see that a model with only the intercept term had a SBC value of 93.4337.
By adding hours as a predictor variable in the model, the SBC value dropped to 70.4452.
The next best possible way to improve the model was to add gender as a predictor variable, but this actually increased the SBC value to 71.7383.
Thus, the final model only includes the intercept term and hours studied.
The last portion of the output shows the summary of this fitted regression model:
We can use the values from the Parameter Estimates table to write the fitted regression model:
Exam Score = 67.161689 + 5.250257(hours studied)
We can also see various metrics that tell us how well this model fits the data:
The R-Square value tells us the percentage of variation in the exam scores that can be explained by the number of hours studied and the number of prep exams taken.
In this case, 72.73% of the variation in exam scores can be explained by the number of hours studied and number of prep exams taken.
The Root MSE value is also useful to know. This represents the average distance that the observed values fall from the regression line.
In this regression model, the observed values fall an average of 5.28052 units from the regression line.
Note: Refer to the SAS documentation for a complete list of potential arguments you can use with PROC GLMSELECT.
The following tutorials explain how to perform other common tasks in SAS: