PySpark: How to Use withColumn() with IF ELSE


You can use the following syntax to use the withColumn() function in PySpark with IF ELSE logic:

from pyspark.sql.functions import when

#create new column that contains 'Good' or 'Bad' based on value in points column
df_new = df.withColumn('rating', when(df.points>20, 'Good').otherwise('Bad'))

This particular example creates a new column named rating that returns ‘Good’ if the value in the points column is greater than 20 or the ‘Bad’ otherwise.

The following example shows how to use this syntax in practice.

Example: How to Use withColumn() with IF ELSE in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to create a new column named rating that returns ‘Good’ if the value in the points column is greater than 20 or the ‘Bad’ otherwise:

from pyspark.sql.functions import when

#create new column that contains 'Good' or 'Bad' based on value in points column
df_new = df.withColumn('rating', when(df.points>20, 'Good').otherwise('Bad'))

#view new DataFrame
df_new.show()

+-------+------+------+
|   team|points|rating|
+-------+------+------+
|   Mavs|    18|   Bad|
|   Nets|    33|  Good|
| Lakers|    12|   Bad|
|  Kings|    15|   Bad|
|  Hawks|    19|   Bad|
|Wizards|    24|  Good|
|  Magic|    28|  Good|
|   Jazz|    40|  Good|
|Thunder|    24|  Good|
|  Spurs|    13|   Bad|
+-------+------+------+

The new rating column now displays either ‘Good’ or ‘Bad’ based on the corresponding value in the points column.

For example:

  • The value of points in the first row is not greater than 20, so the rating column returns Bad.
  • The value of points in the second row is greater than 20, so the rating column returns Good.

And so on.

Note that you could also return numeric values if you’d like.

For example, you can use the following syntax to create a new column named rating that returns 1 if the value in the points column is greater than 20 or the 0 otherwise:

from pyspark.sql.functions import when

#create new column that contains 1 or 0 based on value in points column
df_new = df.withColumn('rating', when(df.points>20, 1).otherwise(0))

#view new DataFrame
df_new.show()

+-------+------+------+
|   team|points|rating|
+-------+------+------+
|   Mavs|    18|     0|
|   Nets|    33|     1|
| Lakers|    12|     0|
|  Kings|    15|     0|
|  Hawks|    19|     0|
|Wizards|    24|     1|
|  Magic|    28|     1|
|   Jazz|    40|     1|
|Thunder|    24|     1|
|  Spurs|    13|     0|
+-------+------+------+

We can see that the new rating column now contains either 0 or 1.

Note: You can find the complete documentation for the PySpark withColumn function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Filter by Boolean Column
PySpark: Create Boolean Column Based on Condition
PySpark: How to Convert String to Integer

Leave a Reply

Your email address will not be published. Required fields are marked *