You can use the following syntax to use the **withColumn()** function in PySpark with **IF ELSE** logic:

from pyspark.sql.functions import when #create new column that contains 'Good' or 'Bad' based on value in points column df_new = df.withColumn('rating', when(df.points>20, 'Good').otherwise('Bad'))

This particular example creates a new column named **rating** that returns ‘Good’ if the value in the **points** column is greater than 20 or the ‘Bad’ otherwise.

The following example shows how to use this syntax in practice.

**Example: How to Use withColumn() with IF ELSE in PySpark**

Suppose we have the following PySpark DataFrame that contains information about points scored by basketball players on various teams:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 18],
['Nets', 33],
['Lakers', 12],
['Kings', 15],
['Hawks', 19],
['Wizards', 24],
['Magic', 28],
['Jazz', 40],
['Thunder', 24],
['Spurs', 13]]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+------+
| team|points|
+-------+------+
| Mavs| 18|
| Nets| 33|
| Lakers| 12|
| Kings| 15|
| Hawks| 19|
|Wizards| 24|
| Magic| 28|
| Jazz| 40|
|Thunder| 24|
| Spurs| 13|
+-------+------+
**

We can use the following syntax to create a new column named **rating **that returns ‘Good’ if the value in the **points** column is greater than 20 or the ‘Bad’ otherwise:

from pyspark.sql.functions import when #create new column that contains 'Good' or 'Bad' based on value in points column df_new = df.withColumn('rating', when(df.points>20, 'Good').otherwise('Bad')) #view new DataFrame df_new.show() +-------+------+------+ | team|points|rating| +-------+------+------+ | Mavs| 18| Bad| | Nets| 33| Good| | Lakers| 12| Bad| | Kings| 15| Bad| | Hawks| 19| Bad| |Wizards| 24| Good| | Magic| 28| Good| | Jazz| 40| Good| |Thunder| 24| Good| | Spurs| 13| Bad| +-------+------+------+

The new **rating **column now displays either ‘Good’ or ‘Bad’ based on the corresponding value in the **points** column.

For example:

- The value of
**points**in the first row is not greater than 20, so the**rating**column returns**Bad**. - The value of
**points**in the second row is greater than 20, so the**rating**column returns**Good**.

And so on.

Note that you could also return numeric values if you’d like.

For example, you can use the following syntax to create a new column named **rating** that returns 1 if the value in the **points** column is greater than 20 or the 0 otherwise:

from pyspark.sql.functions import when #create new column that contains 1 or 0 based on value in points column df_new = df.withColumn('rating', when(df.points>20, 1).otherwise(0)) #view new DataFrame df_new.show() +-------+------+------+ | team|points|rating| +-------+------+------+ | Mavs| 18| 0| | Nets| 33| 1| | Lakers| 12| 0| | Kings| 15| 0| | Hawks| 19| 0| |Wizards| 24| 1| | Magic| 28| 1| | Jazz| 40| 1| |Thunder| 24| 1| | Spurs| 13| 0| +-------+------+------+

We can see that the new **rating **column now contains either 0 or 1.

**Note**: You can find the complete documentation for the PySpark **withColumn** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Filter by Boolean Column

PySpark: Create Boolean Column Based on Condition

PySpark: How to Convert String to Integer