PySpark: How to Add Column from Another DataFrame


You can use the following syntax to add a column from one PySpark DataFrame to another DataFrame:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#add column to each DataFrame called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('id', row_number().over(w))
df2 = df2.withColumn('id', row_number().over(w))

#join together both DataFrames using 'id' column
final_df = df1.join(df2, on=['id']).drop('id')

The following example shows how to use this syntax in practice.

Example: How to Add Column from Another DataFrame in PySpark

Suppose we have one PySpark DataFrame named df1 that contains one column of basketball team names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['Mavs'], 
        ['Nets'],
        ['Nets'],
        ['Blazers'],
        ['Heat'],
        ['Heat'],
        ['Thunder']]
  
#define column name
columns1 = ['team'] 
  
#create dataframe using data and column name
df1 = spark.createDataFrame(data1, columns) 
  
#view dataframe
df1.show()

+-------+
|   team|
+-------+
|   Mavs|
|   Nets|
|   Nets|
|Blazers|
|   Heat|
|   Heat|
|Thunder|
+-------+

And suppose we have another PySpark DataFrame named df2 that contains one column of points values:

#define data
data2 = [[22], 
        [25],
        [41],
        [17],
        [32],
        [50],
        [18]]
  
#define column name
columns2 = ['points'] 
  
#create dataframe using data and column name
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+------+
|points|
+------+
|    22|
|    25|
|    41|
|    17|
|    32|
|    50|
|    18|
+------+

We can use the following syntax to add the points column from df2 to the DataFrame named df1:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#add column to each DataFrame called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('id', row_number().over(w))
df2 = df2.withColumn('id', row_number().over(w))

#join together both DataFrames using 'id' column
final_df = df1.join(df2, on=['id']).drop('id')

#view final DataFrame
final_df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    22|
|   Nets|    25|
|   Nets|    41|
|Blazers|    17|
|   Heat|    32|
|   Heat|    50|
|Thunder|    18|
+-------+------+

Notice that the points column from df2 has been successfully added to the DataFrame named df1.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Add New Column with Constant Value
PySpark: How to Print One Column of a DataFrame
PySpark: How to Check Data Type of Columns in DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *