How to Vertically Concatenate DataFrames in PySpark


You can use the following syntax to vertically concatenate multiple PySpark DataFrames:

from functools import reduce
from pyspark.sql import DataFrame

#specify DataFrames to concatenate
df_list = [df1,df2,df3]

#vertically concatenate all DataFrames in list
df_all = reduce(DataFrame.unionAll, df_list)

This particular example uses the reduce function along with the unionAll function to vertically concatenate the DataFrames named df1, df2 and df3 into one DataFrame called df_all.

The following example shows how to use this syntax in practice.

Example: How to Vertically Concatenate DataFrames in PySpark

Suppose we have three PySpark DataFrames that each contain information about points scored by basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12]]

data2 = [['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28]]

data3 = [['Celtics', 25], 
        ['Spurs', 29],
        ['Rockets', 14],
        ['Heat', 30]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframes using data and column names
df1 = spark.createDataFrame(data1, columns) 
df2 = spark.createDataFrame(data2, columns)
df3 = spark.createDataFrame(data3, columns)
  
#view dataframes
df1.show()
df2.show()
df3.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    18|
|  Nets|    33|
|Lakers|    12|
+------+------+

+-------+------+
|   team|points|
+-------+------+
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
+-------+------+

+-------+------+
|   team|points|
+-------+------+
|Celtics|    25|
|  Spurs|    29|
|Rockets|    14|
|   Heat|    30|
+-------+------+

Suppose we would like to vertically concatenate each of the three DataFrames into one DataFrame.

We can use the following syntax to do so:

from functools import reduce
from pyspark.sql import DataFrame

#specify DataFrames to concatenate
df_list = [df1,df2,df3]

#vertically concatenate all DataFrames in list
df_all = reduce(DataFrame.unionAll, df_list)

#view resulting DataFrame
df_all.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|Celtics|    25|
|  Spurs|    29|
|Rockets|    14|
|   Heat|    30|
+-------+------+

The new DataFrame named df_all contains the data from all three DataFrames concatenated vertically.

Note: You can find the complete documentation for the PySpark concat function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Concatenate Columns in PySpark
How to Do a Left Join in PySpark
How to Do an Inner Join in PySpark

Leave a Reply

Your email address will not be published. Required fields are marked *