PySpark: How to Check if Value Exists in Column


You can use the following syntax to check if a specific value exists in a column of a PySpark DataFrame:

df.filter(df.position.contains('Guard')).count()>0

This particular example checks if the string ‘Guard’ exists in the column named position and returns either True or False.

The following example shows how to use this syntax in practice.

Example: Check if Value Exists in Column in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 4], 
        ['A', 'Forward', 8, 5], 
        ['B', 'Guard', 22, 6], 
        ['A', 'Forward', 22, 7], 
        ['C', 'Guard', 14, 12], 
        ['A', 'Guard', 14, 8],
        ['B', 'Forward', 13, 9],
        ['B', 'Center', 7, 9]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      4|
|   A| Forward|     8|      5|
|   B|   Guard|    22|      6|
|   A| Forward|    22|      7|
|   C|   Guard|    14|     12|
|   A|   Guard|    14|      8|
|   B| Forward|    13|      9|
|   B|  Center|     7|      9|
+----+--------+------+-------+

We can use the following syntax to check if the value ‘Guard’ exists in the position column:

#check if 'Guard' exists in position column
df.filter(df.position.contains('Guard')).count()>0

True

The output returns True, which indicates that the value ‘Guard’ does exist in the position column.

Note that we can also use similar syntax to check if a specific value exists in a numeric column.

For example, we can use the following syntax to check if the value 14 exists in the points column:

#check if 14 exists in pointscolumn
df.filter(df.points.contains('14')).count()>0

True

The output returns True, which indicates that the value 14 does exist in the points column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Check Data Type of Columns in DataFrame
PySpark: How to Check if Column Exists in DataFrame
PySpark: How to Check if DataFrame is Empty

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *