PySpark: How to Calculate Correlation Between Two Columns

The Pearson correlation coefficient helps us quantify the strength and direction of the linear relationship between two variables.

To calculate the correlation coefficient between two columns in a PySpark DataFrame, you can use the following syntax:

df.stat.corr('column1', 'column2')

This particular code will return a value between -1 and 1 that represents the Pearson correlation coefficient between column1 and column2.

The following example shows how to use this syntax in practice.

Example: Calculate Correlation Between Two Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[4, 12, 22], 
        [5, 14, 24], 
        [5, 13, 26], 
        [6, 7, 26], 
        [7, 8, 29],
        [8, 8, 32],
        [8, 9, 20],
        [10, 13, 14]]
#define column names
columns = ['assists', 'rebounds', 'points'] 
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
#view dataframe

|      4|      12|    22|
|      5|      14|    24|
|      5|      13|    26|
|      6|       7|    26|
|      7|       8|    29|
|      8|       8|    32|
|      8|       9|    20|
|     10|      13|    14|

We can use the following syntax to calculate the correlation between the assists and points columns in the DataFrame

#calculate correlation between assists and points columns
df.stat.corr('assists', 'points')


The correlation coefficient turns out to be -0.32957.

Since this value is negative, it tells us that there is a negative association between the two variables.

In other words, when the value for assists increases, the value for points tends to decrease.

And when the value for assists decreases, the value for points tends to increase.

Feel free to replace assists and points with whatever column names you’d like to calculate the correlation coefficient between two different columns.

Related: What is Considered to Be a “Strong” Correlation?

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark
How to Sum Multiple Columns in PySpark DataFrame
How to Add Multiple Columns to PySpark DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *