This tutorial explains how to perform multiple linear regression in SAS.
Step 1: Create the Data
Suppose we want to fit a multiple linear regression model that uses number of hours spent studying and number of prep exams taken to predict the final exam score of students:
Exam Score = β0 + β1(hours) +β2(prep exams)
First, we’ll use the following code to create a dataset that contains this information for 20 students:
/*create dataset*/ data exam_data; input hours prep_exams score; datalines; 1 1 76 2 3 78 2 3 85 4 5 88 2 2 72 1 2 69 5 1 94 4 1 94 2 0 88 4 3 92 4 4 90 3 3 75 6 2 96 5 4 90 3 4 82 4 4 85 6 5 99 2 1 83 1 0 62 2 1 76 ; run;
Step 2: Perform Multiple Linear Regression
Next, we’ll use proc reg to fit a multiple linear regression model to the data:
/*fit multiple linear regression model*/ proc reg data=exam_data; model score = hours prep_exams; run;
Here is how to interpret the most relevant numbers in each table:
Analysis of Variance Table:
The overall F-value of the regression model is 23.46 and the corresponding p-value is <.0001.
Since this p-value is less than .05, we conclude that the regression model as a whole is statistically significant.
Model Fit Table:
The R-Square value tells us the percentage of variation in the exam scores that can be explained by the number of hours studied and the number of prep exams taken.
In general, the larger the R-squared value of a regression model the better the predictor variables are able to predict the value of the response variable.
In this case, 73.4% of the variation in exam scores can be explained by the number of hours studied and number of prep exams taken.
The Root MSE value is also useful to know. This represents the average distance that the observed values fall from the regression line.
In this regression model, the observed values fall an average of 5.3657 units from the regression line.
Parameter Estimates Table:
We can use the parameter estimate values in this table to write the fitted regression equation:
Exam score = 67.674 + 5.556*(hours) – .602*(prep_exams)
We can use this equation to find the estimated exam score for a student, based on the number of hours they studied and the number of prep exams they took.
For example, a student that studies for 3 hours and takes 2 prep exams is expected to receive an exam score of 83.1:
Estimated exam score = 67.674 + 5.556*(3) – .602*(2) = 83.1
The p-value for hours (<.0001) is less than .05, which means that it has a statistically significant association with exam score.
However, the p-value for prep exams (.5193) is not less than .05, which means it does not have a statistically significant association with exam score.
We may decide to remove prep exams from the model since it isn’t statistically significant and instead perform simple linear regression using hours studied as the only predictor variable.
The following tutorials explain how to perform other common tasks in SAS: