PySpark: How to Create Column If It Doesn’t Exist


You can use the following syntax to create a column in a PySpark DataFrame only if it doesn’t already exist:

import pyspark.sql.functions as F

#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
    df = df.withColumn('points', F.lit('100'))

This particular example attempts to create a column named points and assign a value of 100 to each row in the column, only if a column named points doesn’t already exist.

The following example shows how to use this syntax in practice.

Example: How to Create Column If It Doesn’t Exist in PySpark

Suppose we have the following PySpark DataFrame with two columns named team and points:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

Suppose we use the following syntax to attempt to add a new column named points:

import pyspark.sql.functions as F

#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
    df = df.withColumn('points', F.lit('100'))

#view updated DataFrame
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

Since a column named points already exists in the DataFrame, a new column was not added.

The points column that already exists remained unchanged.

However, suppose we attempt to add a new column named assists if it doesn’t already exist:

import pyspark.sql.functions as F

#add 'assists' column to DataFrame if it doesn't already exist
if 'assists' not in df.columns:
    df = df.withColumn('assists', F.lit('100'))

#view updated DataFrame
df.show()

+-------+------+-------+
|   team|points|assists|
+-------+------+-------+
|   Mavs|    18|    100|
|   Nets|    33|    100|
| Lakers|    12|    100|
|  Kings|    15|    100|
|  Hawks|    19|    100|
|Wizards|    24|    100|
|  Magic|    28|    100|
|   Jazz|    40|    100|
|Thunder|    24|    100|
|  Spurs|    13|    100|
+-------+------+-------+

Since a column named assists did not already exist in the DataFrame, this new column was added to the DataFrame.

Note that we used the lit function to assign a literal value of 100 to each row in this new assists column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Add New Column with Constant Value
PySpark: How to Add Column from Another DataFrame
PySpark: How to Print One Column of a DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *