You can use the following syntax to count the number of distinct values in one column of a PySpark DataFrame, grouped by another column:

from pyspark.sql.functions import countDistinct df.groupBy('team').agg(countDistinct('points')).show()

This particular example calculates the number of distinct values in the **points** column, grouped by the values in the **team** column.

The following example shows how to use this syntax in practice.

**Example: How to Use groupBy with Count Distinct in PySpark**

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['B', 'Forward', 14],
['C', 'Forward', 23],
['C', 'Guard', 30]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 14|
| C| Forward| 23|
| C| Guard| 30|
+----+--------+------+
**

We can use the following syntax to calculate the number of distinct values in the **points** column, grouped by the values in the **team** column:

from pyspark.sql.functions import countDistinct #calculate distinct values in points column, grouped by team column df.groupBy('team').agg(countDistinct('points')).show() +----+-------------+ |team|count(points)| +----+-------------+ | B| 2| | C| 2| | A| 3| +----+-------------+

The resulting DataFrame shows the number of distinct values in the **points** column, grouped by the values in the **team** column.

For example, we can see:

- There are
**2**distinct values in the points column for team B. - There are
**2**distinct values in the points column for team C. - There are
**3**distinct values in the points column for team A.

If you would like to give the **count(points)** column a different name, you can use the **alias** function as follows:

from pyspark.sql.functions import countDistinct #calculate distinct values in points column, grouped by team column df.groupBy('team').agg(countDistinct('points').alias('distinct_points')).show() +----+---------------+ |team|distinct_points| +----+---------------+ | B| 2| | C| 2| | A| 3| +----+---------------+

The resulting DataFrame shows the number of distinct points values for each team with the distinct column now named **distinct_points**, just as we specified in the **alias** function.

**Note**: You can find the complete documentation for the PySpark **groupBy** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Use groupBy on Multiple Columns in PySpark

How to Sum Multiple Columns in PySpark DataFrame

How to Add Multiple Columns to PySpark DataFrame