You can use the following syntax to perform data binning in a PySpark DataFrame:

from pyspark.ml.feature import Bucketizer #specify bin ranges and column to bin bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')], inputCol='points', outputCol='bins') #perform binning based on values in 'points' column df_bins = bucketizer.setHandleInvalid('keep').transform(df)

This particular example adds a new column to a DataFrame named **bins **that takes on the following values:

**0**if the value in the points column is in the range [0,5)**1**if the value in the points column is in the range [5,10)**2**if the value in the points column is in the range [10,15)**3**if the value in the points column is in the range [15,20)**4**if the value in the points column is in the range [20,Infinity)

The following example shows how to use this function in practice.

**Example: How to Perform Data Binning in PySpark**

Suppose we have the following PySpark DataFrame that contains information about points scored by basketball players on various teams:

from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 3], ['B', 8], ['C', 9], ['D', 9], ['E', 12], ['F', None], ['G', 15], ['H', 17], ['I', 19], ['J', 22]] #define column names columns = ['player', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +------+------+ |player|points| +------+------+ | A| 3| | B| 8| | C| 9| | D| 9| | E| 12| | F| null| | G| 15| | H| 17| | I| 19| | J| 22| +------+------+

We can use the following syntax to bin each of the values in the **points** column based on specific bin ranges:

from pyspark.ml.feature import Bucketizer #specify bin ranges and column to bin bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')], inputCol='points', outputCol='bins') #perform binning based on values in 'points' column df_bins = bucketizer.setHandleInvalid('keep').transform(df) #view new DataFrame df_bins.show() +------+------+----+ |player|points|bins| +------+------+----+ | A| 3| 0.0| | B| 8| 1.0| | C| 9| 1.0| | D| 9| 1.0| | E| 12| 2.0| | F| null|null| | G| 15| 3.0| | H| 17| 3.0| | I| 19| 3.0| | J| 22| 4.0| +------+------+----+

The **bins** column now displays a value of 0, 1, 2, 3, 4 or null based on the corresponding value in the **points** column.

Note that the argument **setHandleInvalid(‘keep’)** specifies that any invalid values such as nulls should be kept and placed into their own bin.

We could also specify **setHandleInvalid(‘skip’)** to simply remove invalid values from the DataFrame:

from pyspark.ml.feature import Bucketizer #specify bin ranges and column to bin bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')], inputCol='points', outputCol='bins') #perform binning based on values in 'points' column, remove invalid values df_bins = bucketizer.setHandleInvalid('skip').transform(df) #view new DataFrame df_bins.show() +------+------+----+ |player|points|bins| +------+------+----+ | A| 3| 0.0| | B| 8| 1.0| | C| 9| 1.0| | D| 9| 1.0| | E| 12| 2.0| | G| 15| 3.0| | H| 17| 3.0| | I| 19| 3.0| | J| 22| 4.0| +------+------+----+

Notice that the row that contained **null** in the **points** column has simply been removed.

**Note**: You can find the complete documentation for the PySpark **Bucketizer** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Check Data Type of Columns in DataFrame

PySpark: How to Drop Multiple Columns from DataFrame

PySpark: How to Drop Duplicate Rows from DataFrame