You can use the following syntax to perform a union on two PySpark DataFrames that contain different columns:
df_union = df1.unionByName(df2, allowMissingColumns=True)
This particular example performs a union between the PySpark DataFrames named df1 and df2.
By using the argument allowMissingColumns=True, we specify that the set of column names between the two DataFrames are allowed to differ.
The following example shows how to use this syntax in practice.
Example: How to Union DataFrames with Different Columns in PySpark
Suppose we have the following PySpark DataFrame named df1 that contains the columns team, conference and points:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data1 = [['A', 'East', 11],
['B', 'East', 8],
['C', 'East', 31],
['D', 'West', 16],
['E', 'West', 6],
['F', 'East', 5]]
#define column names
columns1 = ['team', 'conference', 'points']
#create DataFrame
df1 = spark.createDataFrame(data1, columns1)
#view DataFrame
df1.show()
+----+----------+------+
|team|conference|points|
+----+----------+------+
| A| East| 11|
| B| East| 8|
| C| East| 31|
| D| West| 16|
| E| West| 6|
| F| East| 5|
+----+----------+------+
And suppose we have another DataFrame named df2 that contains the columns team and assists:
#define data
data2 = [['G', 4],
['H', 8],
['I', 11],
['J', 5],
['K', 2],
['L', 4]]
#define column names
columns2 = ['team', 'assists']
#create DataFrame
df2 = spark.createDataFrame(data2, columns2)
#view DataFrame
df2.show()
+----+-------+
|team|assists|
+----+-------+
| G| 4|
| H| 8|
| I| 11|
| J| 5|
| K| 2|
| L| 4|
+----+-------+
We can use the following syntax to perform a union on these two DataFrames:
#perform union with df1 and df2 df_union = df1.unionByName(df2, allowMissingColumns=True) #view final DataFrame df_union.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| null| | A| East| 8| null| | A| East| 31| null| | B| West| 16| null| | B| West| 6| null| | C| East| 5| null| | A| null| null| 4| | A| null| null| 8| | A| null| null| 11| | B| null| null| 5| | B| null| null| 2| | C| null| null| 4| +----+----------+------+-------+
The final DataFrame contains all of the rows from both DataFrames and any columns that don’t match between the two DataFrames simply produce null values.
Note: You can find the complete documentation for the PySpark unionByName function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Concatenate Columns in PySpark
How to Vertically Concatenate DataFrames in PySpark
How to Get Rows Which Are Not in Another PySpark DataFrame