How to Count by Group in PySpark (With Examples)


You can use the following methods to count values by group in a PySpark DataFrame:

Method 1: Count Values Grouped by One Column

df.groupBy('col1').count().show()

Method 2: Count Values Grouped by Multiple Columns

df.groupBy('col1', 'col2').count().show() 

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

Example 1: Count Values Grouped by One Column

We can use the following syntax to count the number of rows in the DataFrame grouped by the values in the team column:

#count number of values by team
df.groupBy('team').count().show()

+----+-----+
|team|count|
+----+-----+
|   A|    4|
|   B|    4|
|   C|    2|
+----+-----+

From the output we can see:

  • There are 4 players on team A
  • There are 4 players on team B
  • There are 2 players on team C

Example 2: Count Values Grouped by Multiple Columns

We can use the following syntax to count the number of rows in the DataFrame grouped by the values in the team and position columns:

#count number of values by team and position
df.groupBy('team', 'position').count().show()

+----+--------+-----+
|team|position|count|
+----+--------+-----+
|   A|   Guard|    2|
|   A| Forward|    2|
|   B|   Guard|    3|
|   B| Forward|    1|
|   C| Forward|    1|
|   C|   Guard|    1|
+----+--------+-----+

From the output we can see:

  • There are 2 players on team A with a position of Guard.
  • There are 2 players on team A with a position of Forward.
  • There are 3 players on team B with a position of Guard.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use “OR” Operator
PySpark: How to Use “AND” Operator
PySpark: How to Use “NOT IN” Operator

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *