A Q-Q plot, short for “quantile-quantile” plot, is often used to assess whether or not the residuals in a regression analysis are normally distributed.
This tutorial explains how to create and interpret a Q-Q plot in Stata.
Example: Q-Q Plot in Stata
For this example we will use the built-in auto dataset in Stata. We will fit a multiple linear regression model, using mpg and displacement as the explanatory variables and price as the response variable. We will then obtain the residuals for the model and create a Q-Q plot to see if the residuals following a normal distribution.
Step 1: Load and view the data.
First, we’ll load the data using the following command:
Next, we’ll get a quick summary of the data using the following command:
Step 2: Fit the regression model.
Next, we’ll use the following command to fit the regression model:
regress price mpg displacement
Step 3: Calculate the residuals.
Recall that a residual is simply the difference between the predicted response value (as calculated by the estimated regression equation) and the actual response value.
We can obtain the residuals of each prediction by using the residuals command and storing these values in a variable named whatever we’d like. In this case, we’ll use the name resid_price:
predict resid_price, residuals
Step 4: Create the Q-Q Plot.
Now that we have a list of residuals, we can create a Q-Q plot using the qnorm command:
Step 5: Interpret the Q-Q Plot.
The idea behind a Q-Q plot is simple: if the residuals fall along a roughly straight line at a 45-degree angle, then the residuals are roughly normally distributed. We can see in our Q-Q plot above that the residuals tend to deviate from the 45-degree line quite a bit, especially on the tail ends, which could be an indication that they’re not normally distributed.
Although a Q-Q plot isn’t a formal statistical test, it offers an easy way to visually check whether or not the residuals are normally distributed.
If it turns out that your residuals deviate severely from the 45-degree line in the Q-Q plot, you may consider performing a transformation on the response variable in your regression, such as using the square root or the log of the response variable.
If the residuals only deviate slightly, you don’t need to worry about transforming the response variable as regression is fairly robust to departures from normality.