You can use the following methods to add multiple new columns to a PySpark DataFrame:
Method 1: Add Multiple Empty Columns
from pyspark.sql.functions import lit #add three empty columns for col in ['new_col1', 'new_col2', 'new_col3']: df = df.withColumn(col, lit(None))
Method 2: Add Multiple Columns Based on Existing Columns
#add three new columns based on values in 'points' columns df = df.withColumn('points2', df.points*2)\ .withColumn('points3', df.points*3)\ .withColumn('points_half', df.points/2)
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 22], ['A', 'Forward', 22], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Example 1: Add Multiple Empty Columns
We can use the following syntax to add three new empty columns to the existing DataFrame:
from pyspark.sql.functions import lit #add three empty columns for col in ['new_col1', 'new_col2', 'new_col3']: df = df.withColumn(col, lit(None)) #view updated DataFrame df.show() +----+--------+------+--------+--------+--------+ |team|position|points|new_col1|new_col2|new_col3| +----+--------+------+--------+--------+--------+ | A| Guard| 11| null| null| null| | A| Guard| 8| null| null| null| | A| Forward| 22| null| null| null| | A| Forward| 22| null| null| null| | B| Guard| 14| null| null| null| | B| Guard| 14| null| null| null| | B| Forward| 13| null| null| null| | B| Forward| 7| null| null| null| +----+--------+------+--------+--------+--------+
Notice that three new columns with the names new_col1, new_col2 and new_col3 have all been added to the existing DataFrame.
Since we use the lit function to specify a literal value of None, the values in each new column are simply all null values.
Example 2: Add Multiple Columns Based on Existing Columns
We can use the following syntax to add three new columns whose values are all based on the values in the existing points column of the DataFrame:
#add three new columns based on values in 'points' columns df = df.withColumn('points2', df.points*2)\ .withColumn('points3', df.points*3)\ .withColumn('points_half', df.points/2) #view updated DataFrame df.show() +----+--------+------+-------+-------+-----------+ |team|position|points|points2|points3|points_half| +----+--------+------+-------+-------+-----------+ | A| Guard| 11| 22| 33| 5.5| | A| Guard| 8| 16| 24| 4.0| | A| Forward| 22| 44| 66| 11.0| | A| Forward| 22| 44| 66| 11.0| | B| Guard| 14| 28| 42| 7.0| | B| Guard| 14| 28| 42| 7.0| | B| Forward| 13| 26| 39| 6.5| | B| Forward| 7| 14| 21| 3.5| +----+--------+------+-------+-------+-----------+
Notice that three new columns have been added to the DataFrame whose values are based on the existing values in the points colulmn.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Add New Column with Constant Value
PySpark: How to Add Column from Another DataFrame
PySpark: How to Print One Column of a DataFrame