You can use the following syntax to add a column from one PySpark DataFrame to another DataFrame:
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
#add column to each DataFrame called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('id', row_number().over(w))
df2 = df2.withColumn('id', row_number().over(w))
#join together both DataFrames using 'id' column
final_df = df1.join(df2, on=['id']).drop('id')
The following example shows how to use this syntax in practice.
Example: How to Add Column from Another DataFrame in PySpark
Suppose we have one PySpark DataFrame named df1 that contains one column of basketball team names:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data1 = [['Mavs'], ['Nets'], ['Nets'], ['Blazers'], ['Heat'], ['Heat'], ['Thunder']] #define column name columns1 = ['team'] #create dataframe using data and column name df1 = spark.createDataFrame(data1, columns) #view dataframe df1.show() +-------+ | team| +-------+ | Mavs| | Nets| | Nets| |Blazers| | Heat| | Heat| |Thunder| +-------+
And suppose we have another PySpark DataFrame named df2 that contains one column of points values:
#define data data2 = [[22], [25], [41], [17], [32], [50], [18]] #define column name columns2 = ['points'] #create dataframe using data and column name df2 = spark.createDataFrame(data2, columns2) #view dataframe df2.show() +------+ |points| +------+ | 22| | 25| | 41| | 17| | 32| | 50| | 18| +------+
We can use the following syntax to add the points column from df2 to the DataFrame named df1:
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
#add column to each DataFrame called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('id', row_number().over(w))
df2 = df2.withColumn('id', row_number().over(w))
#join together both DataFrames using 'id' column
final_df = df1.join(df2, on=['id']).drop('id')
#view final DataFrame
final_df.show()
+-------+------+
| team|points|
+-------+------+
| Mavs| 22|
| Nets| 25|
| Nets| 41|
|Blazers| 17|
| Heat| 32|
| Heat| 50|
|Thunder| 18|
+-------+------+
Notice that the points column from df2 has been successfully added to the DataFrame named df1.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Add New Column with Constant Value
PySpark: How to Print One Column of a DataFrame
PySpark: How to Check Data Type of Columns in DataFrame