PySpark: How to Check if Column Contains String


You can use the following methods to check if a column of a PySpark DataFrame contains a string:

Method 1: Check if Exact String Exists in Column

#check if 'conference' column contains exact string 'Eas' in any row
df.where(df.conference=='Eas').count()>0

Method 2: Check if Partial String Exists in Column

#check if 'conference' column contains partial string 'Eas' in any row
df.filter(df.conference.contains('Eas')).count()>0

Method 3: Count Occurrences of Partial String in Column

#count occurrences of partial string 'Eas' in 'conference' column
df.filter(df.conference.contains('Eas')).count()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

Example 1: Check if Exact String Exists in Column

The following code shows how to check if the exact string ‘Eas’ exists in the conference column of the DataFrame:

#check if 'conference' column contains exact string 'Eas' in any row
df.where(df.conference=='Eas').count()>0

False

The output returns False, which tells us that the exact string ‘Eas’ does not exist in the conference column of the DataFrame.

Example 2: Check if Partial String Exists in Column

The following code shows how to check if the partial string ‘Eas’ exists in the conference column of the DataFrame:

#check if 'conference' column contains partial string 'Eas' in any row
df.filter(df.conference.contains('Eas')).count()>0

True

The output returns True, which tells us that the partial string ‘Eas’ does exist in the conference column of the DataFrame.

Example 3: Count Occurrences of Partial String in Column

The following code shows how to count the number of times the partial string ‘Eas’ occurs in the conference column of the DataFrame:

#count occurrences of partial string 'Eas' in 'conference' column
df.filter(df.conference.contains('Eas')).count()

4

The output returns 4, which tells us that the partial string ‘Eas’ occurs 4 times in the conference column of the DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows by Index in DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *