You can use the following syntax to calculate the percentage of total rows that each group represents in a PySpark DataFrame:

#calculate total rows in DataFrame n = df.count() #calculate percent of total rows for each team df.groupBy('team').count().withColumn('team_percent', (F.col('count')/n)*100).show()

This particular example counts the number of occurrences for each unique value in the **team** column and then calculates the percentage of total rows that each unique value represents.

The following example shows how to use this syntax in practice.

**Example: Calculate Percentage of Total with groupBy in PySpark**

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['C', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| C| Forward| 7|
+----+--------+------+**

We can use the following syntax to count the number of occurrences if each unique value in the **team** column and then calculate the percentage of total rows that each unique team value represents:

#calculate total rows in DataFrame n = df.count() #calculate percent of total rows for each team df.groupBy('team').count().withColumn('team_percent', (F.col('count')/n)*100).show() +----+-----+------------+ |team|count|team_percent| +----+-----+------------+ | A| 4| 50.0| | B| 3| 37.5| | C| 1| 12.5| +----+-----+------------+

The team_percent column shows the percentage of total rows represented by each unique team.

For example, there are 8 total rows in the DataFrame.

From the **team_percent** column, we can see:

- There are 4 occurrences of team
**A**, which represents 4/8 =**50%**of the total rows. - There are 3 occurrences of team
**B**, which represents 3/8 =**37.5%**of the total rows. - There is 1 occurrence of team
**C**, which represents 1/8 =**12.5%**of the total rows.

**Note**: You can find the complete documentation for the PySpark **groupBy** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use Groupby Agg on Multiple Columns

PySpark: How to Use groupBy with Count Distinct

PySpark: How to Use Alias After Groupby Count