The Pearson correlation coefficient helps us quantify the strength and direction of the linear relationship between two variables.

To calculate the correlation coefficient between two columns in a PySpark DataFrame, you can use the following syntax:

df.stat.corr('column1', 'column2')

This particular code will return a value between -1 and 1 that represents the Pearson correlation coefficient between **column1** and **column2**.

The following example shows how to use this syntax in practice.

**Example: Calculate Correlation Between Two Columns in PySpark**

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
****data = [[4, 12, 22],
[5, 14, 24],
[5, 13, 26],
[6, 7, 26],
[7, 8, 29],
[8, 8, 32],
[8, 9, 20],
[10, 13, 14]]
#define column names
columns = ['assists', 'rebounds', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+--------+------+
|assists|rebounds|points|
+-------+--------+------+
| 4| 12| 22|
| 5| 14| 24|
| 5| 13| 26|
| 6| 7| 26|
| 7| 8| 29|
| 8| 8| 32|
| 8| 9| 20|
| 10| 13| 14|
+-------+--------+------+
**

We can use the following syntax to calculate the correlation between the **assists** and **points **columns in the DataFrame

#calculate correlation between assists and points columns df.stat.corr('assists', 'points') -0.32957304910500873

The correlation coefficient turns out to be **-0.32957**.

Since this value is negative, it tells us that there is a negative association between the two variables.

In other words, when the value for **assists** increases, the value for **points** tends to decrease.

And when the value for **assists** decreases, the value for **points** tends to increase.

Feel free to replace **assists** and **points** with whatever column names you’d like to calculate the correlation coefficient between two different columns.

**Related:** What is Considered to Be a “Strong” Correlation?

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Sum Multiple Columns in PySpark DataFrame

How to Add Multiple Columns to PySpark DataFrame