You can use the following basic syntax to create a duplicate column in a PySpark DataFrame:
df_new = df.withColumn('my_duplicate_column', df['original_column'])
The following example shows how to use this syntax in practice.
Example: How to Create Duplicate Column in PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11, 5], ['A', 'Guard', 8, 4], ['A', 'Forward', 22, 3], ['A', 'Forward', 22, 6], ['B', 'Guard', 14, 3], ['B', 'Guard', 14, 5], ['B', 'Forward', 13, 7], ['B', 'Forward', 14, 8], ['C', 'Forward', 23, 2], ['C', 'Guard', 30, 5]] #define column names columns = ['team', 'position', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+-------+ |team|position|points|assists| +----+--------+------+-------+ | A| Guard| 11| 5| | A| Guard| 8| 4| | A| Forward| 22| 3| | A| Forward| 22| 6| | B| Guard| 14| 3| | B| Guard| 14| 5| | B| Forward| 13| 7| | B| Forward| 14| 8| | C| Forward| 23| 2| | C| Guard| 30| 5| +----+--------+------+-------+
We can use the following code to create a duplicate of the points column and name it points_duplicate:
#create duplicate of 'points' column
df_new = df.withColumn('points_duplicate', df['points'])
#view new DataFrame
df_new.show()
+----+--------+------+-------+----------------+
|team|position|points|assists|points_duplicate|
+----+--------+------+-------+----------------+
| A| Guard| 11| 5| 11|
| A| Guard| 8| 4| 8|
| A| Forward| 22| 3| 22|
| A| Forward| 22| 6| 22|
| B| Guard| 14| 3| 14|
| B| Guard| 14| 5| 14|
| B| Forward| 13| 7| 13|
| B| Forward| 14| 8| 14|
| C| Forward| 23| 2| 23|
| C| Guard| 30| 5| 30|
+----+--------+------+-------+----------------+
Notice that the points_duplicate column contains the exact same values as the points column.
Note that the duplicate column must have a different name than the original column, or else a duplicate column will not be created.
For example, if we attempt to use the following code to create a duplicate column, it won’t work:
#attempt to create duplicate points column
df_new = df.withColumn('points', df['points'])
#view new DataFrame
df_new.show()
+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
| A| Guard| 11| 5|
| A| Guard| 8| 4|
| A| Forward| 22| 3|
| A| Forward| 22| 6|
| B| Guard| 14| 3|
| B| Guard| 14| 5|
| B| Forward| 13| 7|
| B| Forward| 14| 8|
| C| Forward| 23| 2|
| C| Guard| 30| 5|
+----+--------+------+-------+
No duplicate column was created.
The duplicate column must have a different name than the original column.
Note: You can find the complete documentation for the PySpark withColumn function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Add New Column with Constant Value
PySpark: How to Add New Column with Constant Value
PySpark: How to Add Multiple Columns to DataFrame