You can use the following syntax to give a column an alias for a “count” column after performing a groupBy count in a PySpark DataFrame:
df.groupBy('team').count().withColumnRenamed('count', 'row_count').show()
This particular example counts the number of rows in the DataFrame, grouped by the team column.
Then we use the withColumnRenamed function to rename the “count” column to “row_count” in the resulting DataFrame.
The following example shows how to use this syntax in practice.
Example: How to Use Alias After Groupby Count in PySpark
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Guard', 13],
['B', 'Forward', 7],
['C', 'Guard', 8],
['C', 'Forward', 5]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Guard| 13|
| B| Forward| 7|
| C| Guard| 8|
| C| Forward| 5|
+----+--------+------+
We can use the following syntax to count the number of rows in the DataFrame grouped by the values in the team column:
#count number of rows by team df.groupBy('team').count().show() +----+-----+ |team|count| +----+-----+ | A| 4| | B| 4| | C| 2| +----+-----+
By default, the count function simply uses “count” as the column name in the resulting DataFrame.
However, we could use the following syntax to instead use the name row_count as the column name in the resulting DataFrame:
#count number of rows by team and rename 'count' column to 'row_count' df.groupBy('team').count().withColumnRenamed('count', 'row_count').show() +----+---------+ |team|row_count| +----+---------+ | A| 4| | B| 4| | C| 2| +----+---------+
The DataFrame now uses row_count as the column name, just as we specified.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Count Distinct Values in PySpark
How to Count by Group in PySpark
How to Count Null Values in PySpark