How to Perform an Anti-Join in PySpark


An anti-join allows you to return all rows in one DataFrame that do not have matching values in another DataFrame.

You can use the following syntax to perform an anti-join between two PySpark DataFrames:

df_anti_join = df1.join(df2, on=['team'], how='left_anti')

This particular example will perform an anti-join using the DataFrames named df1 and df2 and will only return the rows from df1 where the value in the team column does not belong in the team column of df2.

The following example shows how to use this syntax in practice.

Example: How to Perform an Anti-Join in PySpark

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 18], 
       ['B', 22], 
       ['C', 19], 
       ['D', 14],
       ['E', 30]]

#define column names
columns1 = ['team', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data1, columns1) 
  
#view dataframe
df1.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   D|    14|
|   E|    30|
+----+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 18], 
       ['B', 22], 
       ['C', 19], 
       ['F', 22],
       ['G', 29]]

#define column names
columns2 = ['team', 'points'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   F|    22|
|   G|    29|
+----+------+

We can use the following syntax to perform an anti-join and return all rows in the first DataFrame that do not have a matching team in the second DataFrame:

#perform anti-join
df_anti_join = df1.join(df2, on=['team'], how='left_anti')

#view resulting DataFrame
df_anti_join.show()

+----+------+
|team|points|
+----+------+
|   D|    14|
|   E|    30|
+----+------+

We can see that there are exactly two teams from the first DataFrame that do not have a matching team name in the second DataFrame.

The anti-join worked as expected.

The end result is one DataFrame that only contains the rows where the team name belongs to the first DataFrame but not the second DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Do a Right Join in PySpark
How to Do a Left Join in PySpark
How to Do a Left Join on Multiple Columns in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *