PySpark: How to Do a Left Join on Multiple Columns


You can use the following syntax in PySpark to perform a left join using multiple columns:

df_joined = df1.join(df2, on=[df1.col1==df2.col1, df1.col2==df2.col2], how='left')

This particular example will perform a left join using the DataFrames named df1 and df2 by joining on the columns named col1 and col2.

All rows from df1 will be returned in the final DataFrame but only the rows from df2 that have a matching value in the columns named col1 and col2 will be returned.

The following example shows how to use this syntax in practice.

Example: Left Join on Multiple Columns in PySpark

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 'G', 18], 
       ['A', 'F', 22], 
       ['B', 'F', 19], 
       ['B', 'G', 14]]

#define column names
columns1 = ['team', 'pos', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data1, columns1) 
  
#view dataframe
df1.show()

+----+---+------+
|team|pos|points|
+----+---+------+
|   A|  G|    18|
|   A|  F|    22|
|   B|  F|    19|
|   B|  G|    14|
+----+---+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 'G', 4], 
       ['A', 'F', 9], 
       ['B', 'F', 8], 
       ['C', 'G', 6],
       ['C', 'F', 5]]

#define column names
columns2 = ['team_name', 'position', 'assists'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+---------+--------+-------+
|team_name|position|assists|
+---------+--------+-------+
|        A|       G|      4|
|        A|       F|      9|
|        B|       F|      8|
|        C|       G|      6|
|        C|       F|      5|
+---------+--------+-------+

Suppose we would like to perform a left join between these two DataFrames by joining on the following columns:

  • Where team from df1 matches team_name from df2.
  • Where pos from df1 matches position from df2.

We can use the following syntax to do so:

#perform left join
df_joined = df1.join(df2, on=[df1.team==df2.team_name, df1.pos==df2.position], how='left')

#view resulting DataFrame
df_joined.show()

+----+---+------+---------+--------+-------+
|team|pos|points|team_name|position|assists|
+----+---+------+---------+--------+-------+
|   A|  G|    18|        A|       G|      4|
|   A|  F|    22|        A|       F|      9|
|   B|  F|    19|        B|       F|      8|
|   B|  G|    14|     null|    null|   null|
+----+---+------+---------+--------+-------+

Lastly, we can drop the team_name and position columns from the resulting DataFrame since they’re redundant:

#drop 'team_name' and 'position' columns from joined DataFrame
df_joined.drop('team_name', 'position').show()

+----+---+------+-------+
|team|pos|points|assists|
+----+---+------+-------+
|   A|  G|    18|      4|
|   A|  F|    22|      9|
|   B|  F|    19|      8|
|   B|  G|    14|   null|
+----+---+------+-------+

We have now successfully performed a left join using multiple columns.

Note that all rows from the left DataFrame (df1) exist in the joined DataFrame, but only the rows from the right DataFrame (df2) that had matching values in both the team and position columns made it to the final joined DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Create New DataFrame from Existing DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Select Columns by Index in DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *