PySpark: How to Calculate Sum of Each Row in DataFrame


You can use the following syntax to calculate the sum of values in each row of a PySpark DataFrame:

from pyspark.sql import functions as F

#add new column that contains sum of each row
df_new = df.withColumn('row_sum', sum([F.col(c) for c in df.columns]))

This particular example creates a new column named row_sum that contains the sum of values in each row.

The following example shows how to use this syntax in practice.

Example: How to Calculate Sum of Each Row in PySpark

Suppose we have the following PySpark DataFrame that shows the number of points scored in three different games by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 16, 10], 
        [12, 10, 13], 
        [8, 10, 20], 
        [15, 15, 15], 
        [19, 3, 15],
        [24, 40, 23],
        [15, 12, 19],
        [10, 10, 16]]
  
#define column names
columns = ['game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+-----+-----+
|game1|game2|game3|
+-----+-----+-----+
|   14|   16|   10|
|   12|   10|   13|
|    8|   10|   20|
|   15|   15|   15|
|   19|    3|   15|
|   24|   40|   23|
|   15|   12|   19|
|   10|   10|   16|
+-----+-----+-----+

We can use the following syntax to create a new column named row_sum that contains the sum of the values in each row:

from pyspark.sql import functions as F

#add new column that contains sum of each row
df_new = df.withColumn('row_sum', sum([F.col(c) for c in df.columns]))

#view new DataFrame
df_new.show()

+-----+-----+-----+-------+
|game1|game2|game3|row_sum|
+-----+-----+-----+-------+
|   14|   16|   10|     40|
|   12|   10|   13|     35|
|    8|   10|   20|     38|
|   15|   15|   15|     45|
|   19|    3|   15|     37|
|   24|   40|   23|     87|
|   15|   12|   19|     46|
|   10|   10|   16|     36|
+-----+-----+-----+-------+

The new column named row_sum contains the sum of the values in each row.

For example:

  • The sum of values in the first row is 14 + 16 + 10 = 40.
  • The sum of values in the first row is 12 + 10 + 13 = 35.
  • The sum of values in the first row is 8 + 10 + 20 = 38.

And so on.

Note: If there are null values in the column, the sum function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Sum Multiple Columns in PySpark
How to Sum Column Based on a Condition in PySpark
How to Calculate Sum by Group in PySpark

Leave a Reply

Your email address will not be published. Required fields are marked *