PySpark: How to Compare Strings Between Two Columns


You can use the following syntax to compare strings between two columns in a PySpark DataFrame:

Method 1: Compare Strings Between Two Columns (Case-Sensitive)

df_new = df.withColumn('equal', df.team1==df.team2)

This particular example compares the strings between columns team1 and team2 and returns either True or False to indicate if the strings are the same or not.

Method 2: Compare Strings Between Two Columns (Case-Insensitive)

from pyspark.sql.functions import lower

df_new = df.withColumn('equal', lower(df.team1)==lower(df.team2) 

This particular example performs a case-insensitive comparison between the strings in columns team1 and team2.

The following example shows how to use each method in practice with the following DataFrame that contains two columns of basketball team names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 'Mavs'], 
        ['Nets', 'nets'], 
        ['Lakers', 'Lakers'], 
        ['Kings', 'Jazz'], 
        ['Hawks', 'HAWKS'],
        ['Wizards', 'Wizards']]
  
#define column names
columns = ['team1', 'team2'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-------+
|  team1|  team2|
+-------+-------+
|   Mavs|   Mavs|
|   Nets|   nets|
| Lakers| Lakers|
|  Kings|   Jazz|
|  Hawks|  HAWKS|
|Wizards|Wizards|
+-------+-------+

Example 1: Compare Strings Between Two Columns (Case-Sensitive)

We can use the following syntax to compare the strings (case-sensitive) between the team1 and team2 columns:

#compare strings between team1 and team2 columns
df_new = df.withColumn('equal', df.team1==df.team2)

#view new DataFrame
df_new.show()

+-------+-------+-----+
|  team1|  team2|equal|
+-------+-------+-----+
|   Mavs|   Mavs| true|
|   Nets|   nets|false|
| Lakers| Lakers| true|
|  Kings|   Jazz|false|
|  Hawks|  HAWKS|false|
|Wizards|Wizards| true|
+-------+-------+-----+

The new column named equal returns True if the strings match (including the case of the strings) between the two columns or False otherwise.

Example 2: Compare Strings Between Two Columns (Case-Insensitive)

We can use the following syntax to compare the strings (case-insensitive) between the team1 and team2 columns:

from pyspark.sql.functions import lower 

#compare strings between team1 and team2 columns
df_new = df.withColumn('equal', lower(df.team1)==lower(df.team2))

#view new DataFrame
df_new.show()

+-------+-------+-----+
|  team1|  team2|equal|
+-------+-------+-----+
|   Mavs|   Mavs| true|
|   Nets|   nets| true|
| Lakers| Lakers| true|
|  Kings|   Jazz|false|
|  Hawks|  HAWKS| true|
|Wizards|Wizards| true|
+-------+-------+-----+

The new column named equal returns True if the strings match (regardless of case) between the two columns or False otherwise.

Note #1: We used the withColumn function to return a new DataFrame with the equal column added and all original columns left the same.

Note #2: You can find the complete documentation for the PySpark withColumn function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Multiple Columns
PySpark: How to Select Columns with Alias
PySpark: How to Select Columns by Index

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *