PySpark: How to Union DataFrames with Different Columns

You can use the following syntax to perform a union on two PySpark DataFrames that contain different columns:

df_union = df1.unionByName(df2, allowMissingColumns=True)

This particular example performs a union between the PySpark DataFrames named df1 and df2.

By using the argument allowMissingColumns=True, we specify that the set of column names between the two DataFrames are allowed to differ.

The following example shows how to use this syntax in practice.

Example: How to Union DataFrames with Different Columns in PySpark

Suppose we have the following PySpark DataFrame named df1 that contains the columns team, conference and points:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 'East', 11], 
        ['B', 'East', 8], 
        ['C', 'East', 31], 
        ['D', 'West', 16], 
        ['E', 'West', 6], 
        ['F', 'East', 5]]
#define column names
columns1 = ['team', 'conference', 'points'] 
#create DataFrame
df1 = spark.createDataFrame(data1, columns1) 
#view DataFrame

|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   F|      East|     5|

And suppose we have another DataFrame named df2 that contains the columns team and assists:

#define data
data2 = [['G', 4], 
        ['H', 8], 
        ['I', 11], 
        ['J', 5], 
        ['K', 2], 
        ['L', 4]]
#define column names
columns2 = ['team', 'assists'] 
#create DataFrame
df2 = spark.createDataFrame(data2, columns2) 
#view DataFrame

|   G|      4|
|   H|      8|
|   I|     11|
|   J|      5|
|   K|      2|
|   L|      4|

We can use the following syntax to perform a union on these two DataFrames:

#perform union with df1 and df2
df_union = df1.unionByName(df2, allowMissingColumns=True)

#view final DataFrame

|   A|      East|    11|   null|
|   A|      East|     8|   null|
|   A|      East|    31|   null|
|   B|      West|    16|   null|
|   B|      West|     6|   null|
|   C|      East|     5|   null|
|   A|      null|  null|      4|
|   A|      null|  null|      8|
|   A|      null|  null|     11|
|   B|      null|  null|      5|
|   B|      null|  null|      2|
|   C|      null|  null|      4|

The final DataFrame contains all of the rows from both DataFrames and any columns that don’t match between the two DataFrames simply produce null values.

Note: You can find the complete documentation for the PySpark unionByName function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Concatenate Columns in PySpark
How to Vertically Concatenate DataFrames in PySpark
How to Get Rows Which Are Not in Another PySpark DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *