You can use the following syntax to convert a column from a Boolean to an integer in PySpark:
from pyspark.sql.functions import when #convert Boolean column to integer column df_new = df.withColumn('int_column', when(df.bool_column==True, 1).otherwise(0))
This particular example converts the Boolean column named bool_column to an integer column named int_column.
Each of the values equal to True in the Boolean column will be shown as 1 in the integer column.
Similarly, each of the values equal to False in the Boolean column will be shown as 0 in the integer column.
The following example shows how to use this syntax in practice.
Example: Convert Boolean Column to Integer in PySpark
Suppose we have the following PySpark DataFrame that contains information about various basketball teams:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 18, True],
['Nets', 33, True],
['Lakers', 12, False],
['Kings', 15, True],
['Hawks', 19, False],
['Wizards', 24, False],
['Magic', 28, True],
['Jazz', 40, False],
['Thunder', 24, False],
['Spurs', 13, True]]
#define column names
columns = ['team', 'points', 'playoffs']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+------+--------+
| team|points|playoffs|
+-------+------+--------+
| Mavs| 18| true|
| Nets| 33| true|
| Lakers| 12| false|
| Kings| 15| true|
| Hawks| 19| false|
|Wizards| 24| false|
| Magic| 28| true|
| Jazz| 40| false|
|Thunder| 24| false|
| Spurs| 13| true|
+-------+------+--------+
The playoffs column is a Boolean column that contains the values true and false to indicate whether or not each team made the playoffs.
We can use the following syntax to create a new column called playoffs_int that converts each of the Boolean values of true and false to the integer values of 1 or 0:
from pyspark.sql.functions import when #convert Boolean column to integer column df_new = df.withColumn('playoffs_int', when(df.playoffs==True, 1).otherwise(0)) #view new DataFrame df_new.show() +-------+------+--------+------------+ | team|points|playoffs|playoffs_int| +-------+------+--------+------------+ | Mavs| 18| true| 1| | Nets| 33| true| 1| | Lakers| 12| false| 0| | Kings| 15| true| 1| | Hawks| 19| false| 0| |Wizards| 24| false| 0| | Magic| 28| true| 1| | Jazz| 40| false| 0| |Thunder| 24| false| 0| | Spurs| 13| true| 1| +-------+------+--------+------------+
The new playoffs_int column now displays all true and false values from the playoffs column as either 1 or 0.
We can use the dtypes function to view the data type of each column in this new DataFrame and verify that the new column is indeed an integer column:
#display data type of each column
df_new.dtypes
[('team', 'string'),
('points', 'bigint'),
('playoffs', 'boolean'),
('playoffs_int', 'int')]
We can see that the new playoffs_int column is indeed an integer column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Filter by Boolean Column
PySpark: Create Boolean Column Based on Condition
PySpark: How to Convert String to Integer