PySpark: How to Remove Special Characters from Column


You can use the following syntax to remove special characters from a column in a PySpark DataFrame:

from pyspark.sql.functions import *

#remove all special characters from each string in 'team' column
df_new = df.withColumn('team', regexp_replace('team', '[^a-zA-Z0-9]', ''))

The following example shows how to use this syntax in practice.

Example: How to Remove Special Characters from Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs^', 18], 
        ['Ne%ts', 33], 
        ['Hawk**s', 12], 
        ['Mavs@', 15], 
        ['Hawks!', 19],
        ['(Cavs)', 24],
        ['Magic', 28]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|  Mavs^|    18|
|  Ne%ts|    33|
|Hawk**s|    12|
|  Mavs@|    15|
| Hawks!|    19|
| (Cavs)|    24|
|  Magic|    28|
+-------+------+

Notice that several of the team names contain special characters.

We can use the following syntax to remove all special characters from each string in the team column of the DataFrame:

from pyspark.sql.functions import *

#remove all special characters from each string in 'team' column
df_new = df.withColumn('team', regexp_replace('team', '[^a-zA-Z0-9]', ''))

#view new DataFrame
df_new.show()

+-----+------+
| team|points|
+-----+------+
| Mavs|    18|
| Nets|    33|
|Hawks|    12|
| Mavs|    15|
|Hawks|    19|
| Cavs|    24|
|Magic|    28|
+-----+------+

Notice that all special characters from each team name have been removed.

Note that we used the regexp_replace function in PySpark to search for specific patterns and replace them with nothing.

In this particular example we looked for all characters that were not equal to lowercase letters, uppercase letters, or numbers and then replaced these characters with nothing.

The end result is that we were able to remove all special characters from each string.

Note: You can find the complete documentation for the PySpark regexp_replace function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Count Values in Column with Condition
PySpark: How to Drop Rows that Contain a Specific Value
PySpark: How to Conditionally Replace Value in Column

Leave a Reply

Your email address will not be published. Required fields are marked *