You can use the following methods to count distinct values in a PySpark DataFrame:

**Method 1: Count Distinct Values in One Column**

from pyspark.sql.functions import col, countDistinctdf.agg(countDistinct(col('my_column')).alias('my_column')).show()

**Method 2: Count Distinct Values in Each Column**

from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).show()

**Method 3: Count Number of Distinct Rows in DataFrame**

df.distinct().count()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['B', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
**

**Example 1: Count Distinct Values in One Column**

We can use the following syntax to count the number of distinct values in just the **team** column of the DataFrame:

from pyspark.sql.functions import col, countDistinct #count number of distinct values in 'team' columndf.agg(countDistinct(col('team')).alias('team')).show() +----+ |team| +----+ | 2| +----+

From the output we can see that there are **2** distinct values in the **team** column.

**Example 2: Count Distinct Values in Each Column**

We can use the following syntax to count the number of distinct values in each column of the DataFrame:

from pyspark.sql.functions import col, countDistinct #count number of distinct values in each columndf.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).show() +----+--------+------+ |team|position|points| +----+--------+------+ | 2| 2| 6| +----+--------+------+

From the output we can see:

- There are
**2**unique values in the**team**column. - There are
**2**unique values in the**position**column. - There are
**6**unique values in the**points**column.

**Example 3: Count Distinct Values in Each Column**

We can use the following syntax to count the number of distinct rows in the DataFrame:

#count number of distinct rows in DataFramedf.distinct().count() 6

From the output we can see that there are **6** distinct rows in the DataFrame.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use “OR” Operator

PySpark: How to Use “AND” Operator

PySpark: How to Use “NOT IN” Operator