In statistics, we’re often interested in understanding how two variables are related to each other. For example, we might want to know:

- What is the relationship between the number of hours a student studies and the exam score they receive?
- What is the relationship between the temperature outside and the number of ice cream cones that a food truck sells?
- What is the relationship between marketing dollars spent and total income earned for a certain business?

In each of these scenarios, we’re trying to understand the relationship between two different variables.

In statistics, one of the most common ways that we quantify a relationship between two variables is by using the Pearson correlation coefficient, which is a measure of the linear association between two variables*. *It has a value between -1 and 1 where:

- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables

Often denoted as *r*, this number helps us understand how strong a relationship is between two variables. **The further away r is from zero, the stronger the relationship between the two variables**.

It’s important to note that two variables could have a strong *positive *correlation or a strong *negative* correlation.

**Strong positive correlation: **When the value of one variable increases, the value of the other variable increases in a similar fashion. For example, the more hours that a student studies, the higher their exam score tends to be. Hours studied and exam scores have a strong positive correlation.

**Strong negative correlation: **When the value of one variable increases, the value of the other variable tends to decrease. For example, the older a chicken becomes, the less eggs they tend to produce. Chicken age and egg production have a strong negative correlation.

The following table shows the rule of thumb for interpreting the strength of the relationship between two variables based on the value of *r*:

Absolute value of r |
Strength of relationship |
---|---|

r < 0.25 | No relationship |

0.25 < r < 0.5 | Weak relationship |

0.5 < r < 0.75 | Moderate relationship |

r > 0.75 | Strong relationship |

The correlation between two variables is considered to be strong if the absolute value of *r *is greater than **0.75**. However, the definition of a “strong” correlation can vary from one field to the next.

**Medical**

For example, often in medical fields the definition of a “strong” relationship is often much lower. If the relationship between taking a certain drug and the reduction in heart attacks is *r* = **0.3,** this might be considered a “weak positive” relationship in other fields, but in medicine it’s significant enough that it would be worth taking the drug to reduce the chances of having a heart attack.

**Human Resources**

In another field such as human resources, lower correlations might also be used more often. For example, the correlation between college grades and job performance has been shown to be about *r *= **0.16**. This is fairly low, but it’s large enough that it’s something a company would at least look at during an interview process.

**Technology**

And in a field like technology, the correlation between variables might need to be much higher in some cases to be considered “strong.” For example, if a company creates a self-driving car and the correlation between the car’s turning decisions and the probability of getting in a wreck is *r* = **0.95**, this is likely too low for the car to be considered safe since the result of making the wrong decision can be fatal.

**Visualizing Correlations**

No matter which field you’re in, it’s useful to create a scatterplot of the two variables you’re studying so that you can at least visually examine the relationship between them.

For example, suppose we have the following dataset that shows the height an weight of 12 individuals:

It’s a bit hard to understand the relationship between these two variables by just looking at the raw data. However, it’s much easier to understand the relationship if we create a scatterplot with height on the x-axis and weight on the y-axis:

Clearly there is a positive relationship between the two variables.

Creating a scatterplot is a good idea for two more reasons:

**(1) A scatterplot allows you to identify outliers that are impacting the correlation.**

One extreme outlier can dramatically change a Pearson correlation coefficient. Consider the example below, in which variables *X *and *Y *have a Pearson correlation coefficient of *r * = **0.00**.

But now imagine that we have one outlier in the dataset:

This outlier causes the correlation to be *r *= **0.878**. This single data point completely changes the correlation and makes it seem as if there is a strong relationship between variables *X *and *Y*, when there really isn’t.

**(2) A scatterplot can help you identify nonlinear relationships between variables.**

A Pearson correlation coefficient merely tells us if two variables are *linearly* related. But even if a Pearson correlation coefficient tells us that two variables are uncorrelated, they could still have some type of nonlinear relationship. This is another reason that it’s helpful to create a scatterplot.

For example, consider the scatterplot below between variables *X *and *Y*, in which their correlation is *r *= **0.00**.

The variables clearly have no linear relationship, but they *do* have a nonlinear relationship: The y values are simply the x values squared. A correlation coefficient by itself couldn’t pick up on this relationship, but a scatterplot could.

**Conclusion**

In summary:

- As a rule of thumb, a correlation greater than 0.75 is considered to be a “strong” correlation between two variables.
- However, this rule of thumb can vary from field to field. For example, a much lower correlation could be considered strong in a medical field compared to a technology field. It’s best to use domain specific expertise when deciding what is considered to be strong.
- When using a correlation to describe the relationship between two variables, it’s useful to also create a scatterplot so that you can identify any outliers in the dataset along with a potential nonlinear relationship.

**Additional Resources**

What is Considered to Be a “Weak” Correlation?

Correlation Matrix Calculator

How to Read a Correlation Matrix

As a rule of thumb, a correlation greater than 0.75 is considered to be a “strong” correlation between two variables.

I need the reference for this statement, is it possible to provide

Hello,

for my Master Thesis I am doing a correlation heat map to throw out some variables that might be correlated and thus should not be included in my linear mixed model. I will use the thresholds indicated here, but I would need any reference. AS I know that my professor doesn’t like internet references. Is there any paper you could provide me, that uses those thresholds?

Thanks

The size of the sample correlation coefficient does not tell its statistical significance, one needs to consult a table that shows the values of the correlation coefficient for different levels of significance. You need to know the sample size before you can say 0.75 is a strong correlation.These tables do not usually appear in modern stat text books because regression has replaced correlation as an investigative tool. Some older books like RA Fisher’s “Statistical Methods for Research Workers,” have such a table. As does Snedecor and Cochran’s Statistical Methods from 1967.

Hi Robert…The statement you provided is largely accurate and highlights several important points about the correlation coefficient, its statistical significance, and historical resources for understanding these concepts. Let me break it down:

### Key Points

1. **Correlation Coefficient and Statistical Significance**:

– The size of the sample correlation coefficient (denoted as \( r \)) does not inherently tell you whether the correlation is statistically significant. To determine the significance, you need to compare the observed correlation to a critical value from a statistical table, which depends on the sample size and the desired level of significance (typically 0.05 or 0.01).

– The significance of a correlation can be tested using the t-distribution, with the test statistic calculated as:

\[

t = \frac{r \sqrt{n-2}}{\sqrt{1-r^2}}

\]

where \( n \) is the sample size. This t-value can then be compared to a critical value from the t-distribution table with \( n-2 \) degrees of freedom.

2. **Sample Size and Interpretation of Correlation**:

– The sample size plays a critical role in determining whether a given correlation coefficient is statistically significant. A correlation of 0.75 might be considered strong, but its statistical significance depends on the sample size. For smaller samples, larger correlation coefficients are needed to achieve significance.

3. **Historical Resources**:

– In the past, statistical significance tables for correlation coefficients were commonly included in statistics textbooks. These tables provided critical values of the correlation coefficient for different sample sizes and significance levels.

– Notably, older statistical texts such as R.A. Fisher’s “Statistical Methods for Research Workers” and Snedecor and Cochran’s “Statistical Methods” (1967) include these tables.

4. **Modern Approaches**:

– Modern statistical practice often relies more on regression analysis than on simple correlation, as regression provides more detailed information about relationships between variables, including directionality and the influence of other variables.

– With advances in computing, software packages can easily calculate the exact p-values for correlation coefficients, making tables less necessary.

### Modern Calculation Example

To determine the significance of a correlation coefficient without using tables, you can use statistical software or programming languages. Here is an example using Python with the `scipy` library:

“`python

import scipy.stats as stats

# Example correlation coefficient and sample size

r = 0.75

n = 30

# Calculate the t-statistic

t_stat = r * (n – 2) ** 0.5 / (1 – r ** 2) ** 0.5

# Calculate the p-value

p_value = 2 * (1 – stats.t.cdf(abs(t_stat), df=n-2))

print(f”t-statistic: {t_stat}, p-value: {p_value}”)

“`

### References to Older Resources

If you are interested in historical resources for statistical methods and tables, here are the references:

1. **R.A. Fisher’s “Statistical Methods for Research Workers”**:

– Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver & Boyd.

2. **Snedecor and Cochran’s “Statistical Methods” (1967)**:

– Snedecor, G.W., & Cochran, W.G. (1967). Statistical Methods. Iowa State University Press.

### Conclusion

Your statement is a good summary of the concepts surrounding the correlation coefficient and its significance. Understanding the historical context and modern computational tools provides a well-rounded perspective on this important topic in statistics.