You can use the following syntax to create a column in a PySpark DataFrame only if it doesn’t already exist:
import pyspark.sql.functions as F
#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
df = df.withColumn('points', F.lit('100'))
This particular example attempts to create a column named points and assign a value of 100 to each row in the column, only if a column named points doesn’t already exist.
The following example shows how to use this syntax in practice.
Example: How to Create Column If It Doesn’t Exist in PySpark
Suppose we have the following PySpark DataFrame with two columns named team and points:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+
Suppose we use the following syntax to attempt to add a new column named points:
import pyspark.sql.functions as F
#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
df = df.withColumn('points', F.lit('100'))
#view updated DataFrame
df.show()
+-------+------+
| team|points|
+-------+------+
| Mavs| 18|
| Nets| 33|
| Lakers| 12|
| Kings| 15|
| Hawks| 19|
|Wizards| 24|
| Magic| 28|
| Jazz| 40|
|Thunder| 24|
| Spurs| 13|
+-------+------+
Since a column named points already exists in the DataFrame, a new column was not added.
The points column that already exists remained unchanged.
However, suppose we attempt to add a new column named assists if it doesn’t already exist:
import pyspark.sql.functions as F
#add 'assists' column to DataFrame if it doesn't already exist
if 'assists' not in df.columns:
df = df.withColumn('assists', F.lit('100'))
#view updated DataFrame
df.show()
+-------+------+-------+
| team|points|assists|
+-------+------+-------+
| Mavs| 18| 100|
| Nets| 33| 100|
| Lakers| 12| 100|
| Kings| 15| 100|
| Hawks| 19| 100|
|Wizards| 24| 100|
| Magic| 28| 100|
| Jazz| 40| 100|
|Thunder| 24| 100|
| Spurs| 13| 100|
+-------+------+-------+
Since a column named assists did not already exist in the DataFrame, this new column was added to the DataFrame.
Note that we used the lit function to assign a literal value of 100 to each row in this new assists column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Add New Column with Constant Value
PySpark: How to Add Column from Another DataFrame
PySpark: How to Print One Column of a DataFrame