PySpark: How to Check if Column Exists in DataFrame


You can use the following methods in PySpark to check if a particular column exists in a DataFrame:

Method 1: Check if Column Exists (Case-Sensitive)

'points' in df.columns

Method 2: Check if Column Exists (Not Case-Sensitive)

'points'.upper() in (name.upper() for name in df.columns)

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', None, 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', None, 12], 
        ['B', 'West', None, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      null|     8|      9|
|   A|      East|    10|      3|
|   B|      West|  null|     12|
|   B|      West|  null|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Check if Column Exists (Case-Sensitive)

We can use the following syntax to check if the column name points exists in the DataFrame:

#check if column name 'points' exists in the DataFrame
'points' in df.columns

True

The output returns True since the column name points does indeed exist in the DataFrame.

Note that this syntax is case-sensitive so if we search instead for the column name Points then we will receive an output of False since the case we searched for doesn’t precisely match the case of the column name in the DataFrame:

#check if column name 'Points' exists in the DataFrame
'Points' in df.columns

False

Example 2: Check if Column Exists (Not Case-Sensitive)

We can use the following syntax to check if the column name Points exists in the DataFrame:

#check if column name 'Points' exists in the DataFrame
'Points'.upper() in (name.upper() for name in df.columns) 

True

The output returns True even though the case of the column name that we searched for didn’t precisely match the column name of points in the DataFrame.

Note: In this example we used the upper() function to first convert our search phrase to all uppercase and convert all column names in the DataFrame to uppercase.

This allowed us to perform a case-insensitive search.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Print One Column of a DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *