How to Add a Count Column to PySpark DataFrame


You can use the following syntax to add a ‘count’ column to a PySpark DataFrame:

import pyspark.sql.functions as F
from pyspark.sql import Window

#define column to count for
w = Window.partitionBy('team')

#add count column to DataFrame
df_new = df.select('team', 'points', F.count('team').over(w).alias('n'))

This particular example adds a new column named n that shows the count of values in the team column.

The following example shows how to use this syntax in practice.

Example: Add Count Column to PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 11], 
        ['A', 8], 
        ['A', 10], 
        ['B', 6], 
        ['B', 6], 
        ['C', 5],
        ['C', 15],
        ['C', 31],
        ['D', 24]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
|   C|    15|
|   C|    31|
|   D|    24|
+----+------+

Suppose we would like to add a new column that displays the count of each unique value in the team column.

We can use the following syntax to do so:

import pyspark.sql.functions as F
from pyspark.sql import Window

#define column to count for
w = Window.partitionBy('team')

#add count column to DataFrame
df_new = df.select('team', 'points', F.count('team').over(w).alias('n'))

#view new DataFrame
df_new.show()

+----+------+---+
|team|points|  n|
+----+------+---+
|   A|    11|  3|
|   A|     8|  3|
|   A|    10|  3|
|   B|     6|  2|
|   B|     6|  2|
|   C|     5|  3|
|   C|    15|  3|
|   C|    31|  3|
|   D|    24|  1|
+----+------+---+

Here is how to interpret the output:

  • The value ‘A’ occurs 3 times in the team column. Thus, a value of 3 is assigned to each row in the n column where the team is A.
  • The value ‘B’ occurs 2 times in the team column. Thus, a value of 2 is assigned to each row in the n column where the team is B.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use “OR” Operator
PySpark: How to Use “AND” Operator
PySpark: How to Use “NOT IN” Operator

Leave a Reply

Your email address will not be published. Required fields are marked *