You can use the following syntax to create a boolean column based on a condition in a PySpark DataFrame:
df_new = df.withColumn('good_player', df.points>20)
This particular example creates a boolean column named good_player that returns one of two values:
- true if the value in the points column is greater than 20.
- false if the value in the points column is not greater than 20.
The following example shows how to use this syntax in practice.
Example: Create Boolean Column Based on Condition in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+
Suppose we would like to create a new boolean column that contains true if the corresponding value in the points column is greater than 20 or false otherwise.
We can use the following syntax to do so:
#create boolean column based on value in points column df_new = df.withColumn('good_player', df.points>20) #view new DataFrame df_new.show() +-------+------+-----------+ | team|points|good_player| +-------+------+-----------+ | Mavs| 18| false| | Nets| 33| true| | Lakers| 12| false| | Kings| 15| false| | Hawks| 19| false| |Wizards| 24| true| | Magic| 28| true| | Jazz| 40| true| |Thunder| 24| true| | Spurs| 13| false| +-------+------+-----------+
The new good_player column returns either true of false based on the value in the points column.
Note: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Check Data Type of Columns in DataFrame
PySpark: How to Print One Column of a DataFrame