PySpark: How to Calculate Correlation Between Two Columns

The Pearson correlation coefficient helps us quantify the strength and direction of the linear relationship between two variables.

To calculate the correlation coefficient between two columns in a PySpark DataFrame, you can use the following syntax:

df.stat.corr('column1', 'column2')

This particular code will return a value between -1 and 1 that represents the Pearson correlation coefficient between column1 and column2.

The following example shows how to use this syntax in practice.

Example: Calculate Correlation Between Two Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[4, 12, 22], 
        [5, 14, 24], 
        [5, 13, 26], 
        [6, 7, 26], 
        [7, 8, 29],
        [8, 8, 32],
        [8, 9, 20],
        [10, 13, 14]]
#define column names
columns = ['assists', 'rebounds', 'points'] 
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
#view dataframe

|      4|      12|    22|
|      5|      14|    24|
|      5|      13|    26|
|      6|       7|    26|
|      7|       8|    29|
|      8|       8|    32|
|      8|       9|    20|
|     10|      13|    14|

We can use the following syntax to calculate the correlation between the assists and points columns in the DataFrame

#calculate correlation between assists and points columns
df.stat.corr('assists', 'points')


The correlation coefficient turns out to be -0.32957.

Since this value is negative, it tells us that there is a negative association between the two variables.

In other words, when the value for assists increases, the value for points tends to decrease.

And when the value for assists decreases, the value for points tends to increase.

Feel free to replace assists and points with whatever column names you’d like to calculate the correlation coefficient between two different columns.

