PySpark: Calculate Percentage of Total with groupBy


You can use the following syntax to calculate the percentage of total rows that each group represents in a PySpark DataFrame:

#calculate total rows in DataFrame
n = df.count()

#calculate percent of total rows for each team
df.groupBy('team').count().withColumn('team_percent', (F.col('count')/n)*100).show()

This particular example counts the number of occurrences for each unique value in the team column and then calculates the percentage of total rows that each unique value represents.

The following example shows how to use this syntax in practice.

Example: Calculate Percentage of Total with groupBy in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['C', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   C| Forward|     7|
+----+--------+------+

We can use the following syntax to count the number of occurrences if each unique value in the team column and then calculate the percentage of total rows that each unique team value represents:

#calculate total rows in DataFrame
n = df.count()

#calculate percent of total rows for each team
df.groupBy('team').count().withColumn('team_percent', (F.col('count')/n)*100).show()

+----+-----+------------+
|team|count|team_percent|
+----+-----+------------+
|   A|    4|        50.0|
|   B|    3|        37.5|
|   C|    1|        12.5|
+----+-----+------------+

The team_percent column shows the percentage of total rows represented by each unique team.

For example, there are 8 total rows in the DataFrame.

From the team_percent column, we can see:

  • There are 4 occurrences of team A, which represents 4/8 = 50% of the total rows.
  • There are 3 occurrences of team B, which represents 3/8 = 37.5% of the total rows.
  • There is 1 occurrence of team C, which represents 1/8 = 12.5% of the total rows.

Note: You can find the complete documentation for the PySpark groupBy function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use Groupby Agg on Multiple Columns
PySpark: How to Use groupBy with Count Distinct
PySpark: How to Use Alias After Groupby Count

Leave a Reply

Your email address will not be published. Required fields are marked *