Stepwise regression is a procedure we can use to build a regression model from a set of predictor variables by entering and removing predictors in a stepwise manner into the model until there is no statistically valid reason to enter or remove any more.
The goal of stepwise regression is to build a regression model that includes all of the predictor variables that are statistically significantly related to the response variable.
To perform stepwise regression in SAS, you can use PROC REG with the SELECTION statement.
The following example shows how to perform stepwise regression in SAS in practice.
Example: Perform Stepwise Regression in SAS
Suppose we have the following dataset in SAS that contains four predictor variables (x1, x2, x3, x4) and one response variable (y):
/*create dataset*/ data my_data; input x1 x2 x3 x4 y; datalines; 1 4 10 13 78 2 4 12 14 81 5 3 7 10 75 8 2 13 9 97 10 5 12 5 95 14 7 8 6 90 17 8 10 6 86 19 5 15 5 90 20 5 12 4 93 21 4 10 3 95 ; run; /*view dataset*/ proc print data=my_data;
Now suppose that we would like to find which combination of predictor variables will produce the best multiple linear regression model.
When we say “best” regression model, we mean the model that maximizes or minimizes some metric.
There are two metrics we commonly use to assess which regression model is best among a group of potential models:
1. Adjusted R-squared: The adjusted R-squared value tells us how useful a model is, adjusted for the number of predictors in a model. The model with the highest adjusted R-squared value is considered the best.
2. AIC: The Akaike information criterion (AIC) is a metric that is used to compare the fit of different regression models. The model with the lowest AIC value is considered the best.
Fortunately, we can calculate both the adjusted R-squared and AIC values for regression models in SAS by using PROC REG with the SELECTION statement.
The following code shows how to do so:
/*perform stepwise multiple linear regression*/ proc reg data=my_data outest=est; model y=x1 x2 x3 x4 / selection=adjrsq aic ; output out=out p=p r=r; run; quit;
The output displays the adjusted R-squared and AIC values for every possible multiple linear regression model.
From the output we can see that the value with the highest adjusted R-squared value and the lowest AIC value is the regression model that uses only x3 and x4 as the predictor variables.
Thus, we would declare the following model to be “best” out of all possible models:
y = b0 + b1(x3) + b2(x4)
This particular regression model has the following metrics:
- Adjusted R-squared value: 0.5923
- AIC: 34.2921
Notes on Selecting the “Best” Regression Model
Note that sometimes the model with the highest adjusted R-squared value does not always have the lowest AIC value as well.
When it comes to deciding which regression model is best, adjusted R-squared and AIC serve as suggestions but in the real world you may have to use domain expertise to determine which model is best.
It can also be a good idea to choose a parsimonious model, which is a model that achieves a desired level of goodness of fit using as few predictor variables as possible.
The reasoning for this type of model stems from the idea of Occam’s Razor (sometimes called the “Principle of Parsimony”) which says that the simplest explanation is most likely the right one.
Applied to statistics, a model that has few parameters but achieves a satisfactory level of goodness of fit should be preferred over a model that has a ton of parameters and achieves only a slightly higher level of goodness of fit.
The following tutorials explain how to perform other common tasks in SAS: