You can use the following methods to remove specific characters from strings in a PySpark DataFrame:
Method 1: Remove Specific Characters from String
from pyspark.sql.functions import * #remove 'avs' from each string in team column df_new = df.withColumn('team', regexp_replace('team', 'avs', ''))
Method 2: Remove Multiple Groups of Specific Characters from String
from pyspark.sql.functions import * #remove 'avs' and 'awks' from each string in team column df_new = df.withColumn('team', regexp_replace('team', 'avs|awks', ''))
The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 18],
['Nets', 33],
['Hawks', 12],
['Mavs', 15],
['Hawks', 19],
['Cavs', 24],
['Magic', 28]]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-----+------+
| team|points|
+-----+------+
| Mavs| 18|
| Nets| 33|
|Hawks| 12|
| Mavs| 15|
|Hawks| 19|
| Cavs| 24|
|Magic| 28|
+-----+------+
Example 1: Remove Specific Characters from String
We can use the following syntax to remove “avs” from any string in the team column of the DataFrame:
from pyspark.sql.functions import * #remove 'avs' from each string in team column df_new = df.withColumn('team', regexp_replace('team', 'avs', '')) #view new DataFrame df_new.show() +-----+------+ | team|points| +-----+------+ | M| 18| | Nets| 33| |Hawks| 12| | M| 15| |Hawks| 19| | C| 24| |Magic| 28| +-----+------+
Notice that the string “avs” has been removed from three team names in the team column of the DataFrame.
Example 2: Remove Multiple Groups of Specific Characters from String
We can use the following syntax to remove the strings “avs” and “awks” from any string in the team column of the DataFrame:
from pyspark.sql.functions import * #remove 'avs' and 'awks' from each string in team column df_new = df.withColumn('team', regexp_replace('team', 'avs|awks', '')) #view new DataFrame df_new.show() +-----+------+ | team|points| +-----+------+ | M| 18| | Nets| 33| | H| 12| | M| 15| | H| 19| | C| 24| |Magic| 28| +-----+------+
Notice that the strings “avs” and “awks” have both been removed from the team names in the team column of the DataFrame.
Note #1: The regexp_replace function is case-sensitive.
Note #2: You can find the complete documentation for the PySpark regexp_replace function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Count Values in Column with Condition
PySpark: How to Drop Rows that Contain a Specific Value
PySpark: How to Conditionally Replace Value in Column