Correlation is used to measure the linear association between two variables.
A correlation coefficient always takes on a value between -1 and 1 where:
- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables
One question students often have is: When should I use correlation?
The short answer: Use correlation when you want to quantify the linear relationship between two variables and neither of the variables represents a response or “outcome” variable.
The following examples illustrate when you should and should not use correlation in practice.
Example 1: When to Use Correlation
Suppose a professor wants to understand the linear relationship between math exam scores and science exam scores for students in his class.
For example, do students who score high on the math exam also score high on the science exam? Or do students who score high in math tend to score low on science?
In this scenario, he could calculate the correlation between the math exam scores and science exam scores because he simply wants to understand the linear relationship between the two variables and neither variable can be considered a response variable.
Suppose he does calculate the Pearson correlation coefficient and finds it to be r = 0.78. This is a strong positive correlation, which means that students who score high on math also tend to score high on science.
Example 2: When Not to Use Correlation
Suppose a marketing department at some company wants to quantify how advertisement spending affects total revenue.
For example, for each additional dollar spent on advertising how much additional revenue can the company expect to earn?
In this scenario, the department should use a linear regression model to quantify the relationship between ad spend and total revenue because the variable “revenue” is the response variable.
Suppose the department does fit a simple linear regression model and finds the following equation best describes the relationship between ad spend and total revenue:
Total revenue = 145.4 + .34*(ad spend)
We would interpret this to mean that each additional dollar spent on advertising results in an average increase of $0.34 in total revenue.
Cautions on Using Correlation
It’s important to note that correlation can only be used to quantify the linear relationship between two variables.
However, in some circumstances a correlation coefficient won’t be able to effectively capture a relationship between two variables that share a non-linear relationship.
For example, suppose we create the following scatterplot to visualize the relationship between two variables:
If we calculate the correlation coefficient between these two variables, it turns out to be r = 0. This means there is no linear relationship between the two variables.
However, from the plot we can see that the two variables do have a relationship – it just happens to be a quadratic relationship instead of a linear one.
Thus, when you calculate the correlation between two variables keep in mind that it can be helpful to create a scatterplot to visualize the relationship between the variables as well.
Even if two variables don’t have a linear relationship, it’s possible that they could have a non-linear relationship which would be revealed in a scatterplot.
The following tutorials further explain how correlation is used in different circumstances: