You can use the following syntax to perform data binning in a PySpark DataFrame:
from pyspark.ml.feature import Bucketizer #specify bin ranges and column to bin bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')], inputCol='points', outputCol='bins') #perform binning based on values in 'points' column df_bins = bucketizer.setHandleInvalid('keep').transform(df)
This particular example adds a new column to a DataFrame named bins that takes on the following values:
- 0 if the value in the points column is in the range [0,5)
- 1 if the value in the points column is in the range [5,10)
- 2 if the value in the points column is in the range [10,15)
- 3 if the value in the points column is in the range [15,20)
- 4 if the value in the points column is in the range [20,Infinity)
The following example shows how to use this function in practice.
Example: How to Perform Data Binning in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by basketball players on various teams:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 3], ['B', 8], ['C', 9], ['D', 9], ['E', 12], ['F', None], ['G', 15], ['H', 17], ['I', 19], ['J', 22]] #define column names columns = ['player', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +------+------+ |player|points| +------+------+ | A| 3| | B| 8| | C| 9| | D| 9| | E| 12| | F| null| | G| 15| | H| 17| | I| 19| | J| 22| +------+------+
We can use the following syntax to bin each of the values in the points column based on specific bin ranges:
from pyspark.ml.feature import Bucketizer #specify bin ranges and column to bin bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')], inputCol='points', outputCol='bins') #perform binning based on values in 'points' column df_bins = bucketizer.setHandleInvalid('keep').transform(df) #view new DataFrame df_bins.show() +------+------+----+ |player|points|bins| +------+------+----+ | A| 3| 0.0| | B| 8| 1.0| | C| 9| 1.0| | D| 9| 1.0| | E| 12| 2.0| | F| null|null| | G| 15| 3.0| | H| 17| 3.0| | I| 19| 3.0| | J| 22| 4.0| +------+------+----+
The bins column now displays a value of 0, 1, 2, 3, 4 or null based on the corresponding value in the points column.
Note that the argument setHandleInvalid(‘keep’) specifies that any invalid values such as nulls should be kept and placed into their own bin.
We could also specify setHandleInvalid(‘skip’) to simply remove invalid values from the DataFrame:
from pyspark.ml.feature import Bucketizer #specify bin ranges and column to bin bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')], inputCol='points', outputCol='bins') #perform binning based on values in 'points' column, remove invalid values df_bins = bucketizer.setHandleInvalid('skip').transform(df) #view new DataFrame df_bins.show() +------+------+----+ |player|points|bins| +------+------+----+ | A| 3| 0.0| | B| 8| 1.0| | C| 9| 1.0| | D| 9| 1.0| | E| 12| 2.0| | G| 15| 3.0| | H| 17| 3.0| | I| 19| 3.0| | J| 22| 4.0| +------+------+----+
Notice that the row that contained null in the points column has simply been removed.
Note: You can find the complete documentation for the PySpark Bucketizer function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Check Data Type of Columns in DataFrame
PySpark: How to Drop Multiple Columns from DataFrame
PySpark: How to Drop Duplicate Rows from DataFrame