A correlation matrix is a square table that shows the correlation coefficients between variables in a dataset.

It offers a quick way to understand the strength of the linear relationships that exist between variables in a dataset.

You can use the following syntax to create a correlation matrix from a PySpark DataFrame:

from pyspark.ml.stat import Correlation from pyspark.ml.feature import VectorAssembler #conver each DataFrame column to vectors vector_col = 'corr_features' assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col) df_vector = assembler.transform(df).select(vector_col) #calculate correlation matrix corr_matrix = Correlation.corr(df_vector, vector_col) #display correlation matrix corr_matrix.collect()[0]['pearson({})'.format(vector_col)].values

This particular code uses the **VectorAssembler** function to first convert the DataFrame columns to vectors, then uses the **Correlation** function from **pyspark.ml.stat** to calculate the correlation matrix.

The following example shows how to use this syntax in practice.

**Example: How to Create a Correlation Matrix in PySpark**

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
****data = [[4, 12, 22],
[5, 14, 24],
[5, 13, 26],
[6, 7, 26],
[7, 8, 29],
[8, 8, 32],
[8, 9, 20],
[10, 13, 14]]
#define column names
columns = ['assists', 'rebounds', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+--------+------+
|assists|rebounds|points|
+-------+--------+------+
| 4| 12| 22|
| 5| 14| 24|
| 5| 13| 26|
| 6| 7| 26|
| 7| 8| 29|
| 8| 8| 32|
| 8| 9| 20|
| 10| 13| 14|
+-------+--------+------+
**

We can use the following syntax to create a correlation matrix for this DataFrame:

from pyspark.ml.stat import Correlation from pyspark.ml.feature import VectorAssembler #conver each DataFrame column to vectors vector_col = 'corr_features' assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col) df_vector = assembler.transform(df).select(vector_col) #calculate correlation matrix corr_matrix = Correlation.corr(df_vector, vector_col) #display correlation matrix corr_matrix.collect()[0]['pearson({})'.format(vector_col)].values array([ 1. , -0.24486081, -0.32957305, -0.24486081, 1. , -0.52209174, -0.32957305, -0.52209174, 1. ])

The correlation coefficients along the diagonal of the table are all equal to 1 because each variable is perfectly correlated with itself.

All of the other correlation coefficients indicate the correlation between different pairwise combinations of variables.

For example:

- The correlation coefficient between assists and rebounds is
**-0.245**. - The correlation coefficient between assists and points is
**-0.330**. - The correlation coefficient between rebounds and points is
**-0.522**.

**Related:** What is Considered to Be a “Strong” Correlation?

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Sum Multiple Columns in PySpark DataFrame

How to Add Multiple Columns to PySpark DataFrame