Correlation and regression are two terms in statistics that are related, but not quite the same.
In this tutorial, we’ll provide a brief explanation of both terms and explain how they’re similar and different.
What is Correlation?
Correlation measures the linear association between two variables, x and y. It has a value between -1 and 1 where:
- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables
For example, suppose we have the following dataset that contains two variables: (1) Hours studied and (2) Exam Score received for 20 different students:
If we created a scatterplot of hours studied vs. exam score, here’s what it would look like:
Just from looking at the plot, we can tell that students who study more tend to earn higher exam scores. In other words, we can visually see that there is a positive correlation between the two variables.
Using a calculator, we can find that the correlation between these two variables is r = 0.915. Since this value is close to 1, it confirms that there is a strong positive correlation between the two variables.
What is Regression?
Regression is a method we can use to understand how changing the values of the x variable affect the values of the y variable.
A regression model uses one variable, x, as the predictor variable, and the other variable, y, as the response variable. It then finds an equation with the following form that best describes the relationship between the two variables:
ŷ = b0 + b1x
- ŷ: The predicted value of the response variable
- b0: The y-intercept (the value of y when x is equal to zero)
- b1: The regression coefficient (the average increase in y for a one unit increase in x)
- x: The value of the predictor variable
For example, consider our dataset from earlier:
Using a linear regression calculator, we find that the following equation best describes the relationship between these two variables:
Predicted exam score = 65.47 + 2.58*(hours studied)
The way to interpret this equation is as follows:
- The predicted exam score for a student who studies zero hours is 65.47.
- The average increase in exam score associated with one additional hour studied is 2.58.
We can also use this equation to predict the score that a student will receive based on the number of hours studied.
For example, a student who studies 6 hours is expected to receive a score of 80.95:
Predicted exam score = 65.47 + 2.58*(6) = 80.95.
We can also plot this equation as a line on a scatterplot:
We can see that the regression line “fits” the data quite well.
Recall earlier that the correlation between these two variables was r = 0.915. It turns out that we can square this value and get a number called “r-squared” that describes the total proportion of variance in the response variable that can be explained by the predictor variable.
In this example, r2 = 0.9152 = 0.837. This means that 83.7% of the variation in exam scores can be explained by the number of hours studied.
Correlation vs. Regression: Similarities & Differences
Here is a summary of the similarities and differences between correlation and regression:
- Both quantify the direction of a relationship between two variables.
- Both quantify the strength of a relationship between two variables.
- Regression is able to show a cause-and-effect relationship between two variables. Correlation does not do this.
- Regression is able to use an equation to predict the value of one variable, based on the value of another variable. Correlation does not does this.
- Regression uses an equation to quantify the relationship between two variables. Correlation uses a single number.
The following tutorials offer more in-depth explanations of topics covered in this post.