PySpark: How to Calculate Minimum Value Across Columns


You can use the following syntax to calculate the minimum value across multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import least

#find minimum value across columns 'game1', 'game2', and 'game3'
df_new = df.withColumn('min', least('game1', 'game2', 'game3'))

This particular example creates a new column called min that contains the minimum of values across the game1, game2 and game3 columns in the DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Calculate Min Value Across Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players during three different games:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Suppose we would like to add a new column call min that contains the minimum of points scored by each player across all three games.

We can use the following syntax to do so:

from pyspark.sql.functions import least

#find minimum value across columns 'game1', 'game2', and 'game3'
df_new = df.withColumn('min', least('game1', 'game2', 'game3'))

#view new DataFrame
df_new.show()

+-------+-----+-----+-----+---+
|   team|game1|game2|game3|min|
+-------+-----+-----+-----+---+
|   Mavs|   25|   11|   10| 10|
|   Nets|   22|    8|   14|  8|
|  Hawks|   14|   22|   10| 10|
|  Kings|   30|   22|   35| 22|
|  Bulls|   15|   14|   12| 12|
|Blazers|   10|   14|   18| 10|
+-------+-----+-----+-----+---+

Notice that the new min column contains the minimum of values across the game1, game2 and game3 columns.

For example:

  • The minimum of points for the Mavs player is 10
  • The minimum of points for the Nets player is 8
  • The minimum of points for the Hawks player is 10

And so on.

Note that we used the withColumn function to return a new DataFrame with the min column added and all other columns left the same.

You can find the complete documentation for the PySpark withColumn function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark
How to Calculate Mean of Multiple Columns in PySpark
How to Sum Multiple Columns in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *