How to Add Multiple Columns to PySpark DataFrame


You can use the following methods to add multiple new columns to a PySpark DataFrame:

Method 1: Add Multiple Empty Columns

from pyspark.sql.functions import lit

#add three empty columns
for col in ['new_col1', 'new_col2', 'new_col3']:
    df = df.withColumn(col, lit(None))

Method 2: Add Multiple Columns Based on Existing Columns

#add three new columns based on values in 'points' columns
df = df.withColumn('points2', df.points*2)\
       .withColumn('points3', df.points*3)\
       .withColumn('points_half', df.points/2)

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Add Multiple Empty Columns

We can use the following syntax to add three new empty columns to the existing DataFrame:

from pyspark.sql.functions import lit

#add three empty columns
for col in ['new_col1', 'new_col2', 'new_col3']:
    df = df.withColumn(col, lit(None))

#view updated DataFrame
df.show()

+----+--------+------+--------+--------+--------+
|team|position|points|new_col1|new_col2|new_col3|
+----+--------+------+--------+--------+--------+
|   A|   Guard|    11|    null|    null|    null|
|   A|   Guard|     8|    null|    null|    null|
|   A| Forward|    22|    null|    null|    null|
|   A| Forward|    22|    null|    null|    null|
|   B|   Guard|    14|    null|    null|    null|
|   B|   Guard|    14|    null|    null|    null|
|   B| Forward|    13|    null|    null|    null|
|   B| Forward|     7|    null|    null|    null|
+----+--------+------+--------+--------+--------+

Notice that three new columns with the names new_col1, new_col2 and new_col3 have all been added to the existing DataFrame.

Since we use the lit function to specify a literal value of None, the values in each new column are simply all null values.

Example 2: Add Multiple Columns Based on Existing Columns

We can use the following syntax to add three new columns whose values are all based on the values in the existing points column of the DataFrame:

#add three new columns based on values in 'points' columns
df = df.withColumn('points2', df.points*2)\
       .withColumn('points3', df.points*3)\
       .withColumn('points_half', df.points/2)

#view updated DataFrame
df.show()

+----+--------+------+-------+-------+-----------+
|team|position|points|points2|points3|points_half|
+----+--------+------+-------+-------+-----------+
|   A|   Guard|    11|     22|     33|        5.5|
|   A|   Guard|     8|     16|     24|        4.0|
|   A| Forward|    22|     44|     66|       11.0|
|   A| Forward|    22|     44|     66|       11.0|
|   B|   Guard|    14|     28|     42|        7.0|
|   B|   Guard|    14|     28|     42|        7.0|
|   B| Forward|    13|     26|     39|        6.5|
|   B| Forward|     7|     14|     21|        3.5|
+----+--------+------+-------+-------+-----------+

Notice that three new columns have been added to the DataFrame whose values are based on the existing values in the points colulmn.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Add New Column with Constant Value
PySpark: How to Add Column from Another DataFrame
PySpark: How to Print One Column of a DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *