PySpark: How to Conditionally Replace Value in Column


You can use the following syntax to conditionally replace the value in one column of a PySpark DataFrame based on the value in another column:

from pyspark.sql.functions import when

df_new = df.withColumn('points', when(df['conference']=='West', 0).otherwise(df['points']))

This particular example replaces the existing value in the points column with a value of 0 for each row where the corresponding value in the conference column is equal to “West.”

The following examples show how to use this syntax in practice.

Example: How to Conditionally Replace Value in Column of PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

We can use the following syntax to replace the existing value in the points column with a value of 0 for each row where the corresponding value in the conference column is equal to “West.”

from pyspark.sql.functions import when

#replace value in points column with 0 if value in conference column is 'West'
df_new = df.withColumn('points', when(df['conference']=='West', 0).otherwise(df['points']))

#view new DataFrame
df_new.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     0|
|   B|      West|     0|
|   C|      East|     5|
+----+----------+------+

Notice that the existing values in the points column have been replaced in the two rows where the value in the conference column is equal to “West.”

All other values in the points column have been left unchanged.

Note: You can find the complete documentation for the PySpark when function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Replace Zero with Null
PySpark: How to Replace String in Column
PySpark: How to Check Data Type of Columns in DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *