PySpark: How to Perform Union and Return Distinct Rows


You can use the following syntax to perform a union on two PySpark DataFrames and return only distinct rows:

df_union_distinct = df1.union(df2).distinct()

This particular example performs a union between the PySpark DataFrames named df1 and df2 and returns only the distinct rows between the two DataFrames.

The following example shows how to use this syntax in practice.

Example: How to Perform Union and Return Distinct Rows in PySpark

Suppose we have the following PySpark DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 'East', 11], 
         ['B', 'East', 8], 
         ['C', 'East', 31], 
         ['D', 'West', 16], 
         ['E', 'West', 6]]
  
#define column names
columns1 = ['team', 'conference', 'points'] 
  
#create DataFrame
df1 = spark.createDataFrame(data1, columns1) 
  
#view DataFrame
df1.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
+----+----------+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 'East', 11], 
         ['B', 'East', 8], 
         ['G', 'East', 31], 
         ['H', 'West', 16]]

#define column names
columns2 = ['team', 'conference', 'points'] 
  
#create DataFrame
df2 = spark.createDataFrame(data2, columns2) 
  
#view DataFrame
df2.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Suppose we use the following syntax to perform a union between these two DataFrames:

#perform union with df1 and df2
df_union = df1.union(df2)

#view final DataFrame
df_union.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   A|      East|    11|
|   B|      East|     8|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Notice that the rows with team A and B from each DataFrame are shown in the final DataFrame, even though these rows are duplicates.

To perform a union and only return the distinct rows, we can use the following syntax:

#perform union with df1 and df2 and return only distinct rows
df_union_distinct = df1.union(df2).distinct()

#view final DataFrame
df_union_distinct.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Notice that we have successfully performed a union between the two DataFrames and only the distinct rows are returned.

Note: You can find the complete documentation for the PySpark union function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Concatenate Columns in PySpark
How to Vertically Concatenate DataFrames in PySpark
How to Get Rows Which Are Not in Another PySpark DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *