PySpark: How to Convert Column to Lowercase


You can use the following syntax to convert a column to lowercase in a PySpark DataFrame:

from pyspark.sql.functions import lower

df = df.withColumn('my_column', lower(df['my_column']))

The following example shows how to use this syntax in practice.

Example: How to Convert Column to Lowercase in PySpark

Suppose we create the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Suppose we would like to convert all strings in the conference column to lowercase.

We can use the following syntax to do so:

from pyspark.sql.functions import lower

#convert 'conference' column to lowercase
df = df.withColumn('conference', lower(df['conference']))

#view updated DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      east|    11|      4|
|   A|      east|     8|      9|
|   A|      east|    10|      3|
|   B|      west|     6|     12|
|   B|      west|     6|      4|
|   C|      east|     5|      2|
+----+----------+------+-------+

Notice that all strings in the conference column of the updated DataFrame are now lowercase.

Note #1: We used the withcolumn function to return a new DataFrame with the conference column modified and all other columns left the same.

Note #2: You can find the complete documentation for the PySpark withColumn function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Multiple Columns
PySpark: How to Select Columns with Alias
PySpark: How to Select Columns by Index

Leave a Reply

Your email address will not be published. Required fields are marked *