You can use the following methods to extract certain substrings from a column in a PySpark DataFrame:
Method 1: Extract Substring from Beginning of String
from pyspark.sql import functions as F #extract first three characters from team column df_new = df.withColumn('first3', F.substring('team', 1, 3))
Method 2: Extract Substring from Middle of String
from pyspark.sql import functions as F #extract four characters starting from position two in team column df_new = df.withColumn('mid4', F.substring('team', 2, 4))
Method 3: Extract Substring from End of String
from pyspark.sql import functions as F #extract last three characters from team column df_new = df.withColumn('last3', F.substring('team', -3, 3))
Method 4: Extract Substring Before Specific Character
from pyspark.sql import functions as F #extract all characters before space in team column df_new = df.withColumn('beforespace', F.substring_index('team', ' ', 1))
Method 5: Extract Substring After Specific Character
from pyspark.sql import functions as F #extract all characters after space in team column df_new = df.withColumn('afterspace', F.substring_index('team', ' ', -1))
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Dallas Mavs', 18],
['Brooklyn Nets', 33],
['Atlanta Hawks', 12],
['Boston Celtics', 15],
['Miami Heat', 19],
['Cleveland Cavs', 24],
['Orlando Magic', 28]]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+--------------+------+
| team|points|
+--------------+------+
| Dallas Mavs| 18|
| Brooklyn Nets| 33|
| Atlanta Hawks| 12|
|Boston Celtics| 15|
| Miami Heat| 19|
|Cleveland Cavs| 24|
| Orlando Magic| 28|
+--------------+------+
Example 1: Extract Substring from Beginning of String
We can use the following syntax to extract the first 3 characters from each string in the team column:
from pyspark.sql import functions as F #extract first three characters from team column df_new = df.withColumn('first3', F.substring('team', 1, 3)) #view updated DataFrame df_new.show() +--------------+------+------+ | team|points|first3| +--------------+------+------+ | Dallas Mavs| 18| Dal| | Brooklyn Nets| 33| Bro| | Atlanta Hawks| 12| Atl| |Boston Celtics| 15| Bos| | Miami Heat| 19| Mia| |Cleveland Cavs| 24| Cle| | Orlando Magic| 28| Orl| +--------------+------+------+
Example 2: Extract Substring from Middle of String
We can use the following syntax to extract the 4 characters starting from position 2 from each string in the team column:
from pyspark.sql import functions as F #extract four characters starting from position two in team column df_new = df.withColumn('mid4', F.substring('team', 2, 4)) #view updated DataFrame df_new.show() +--------------+------+----+ | team|points|mid4| +--------------+------+----+ | Dallas Mavs| 18|alla| | Brooklyn Nets| 33|rook| | Atlanta Hawks| 12|tlan| |Boston Celtics| 15|osto| | Miami Heat| 19|iami| |Cleveland Cavs| 24|leve| | Orlando Magic| 28|rlan| +--------------+------+----+
Example 3: Extract Substring from End of String
We can use the following syntax to extract the last 3 characters from each string in the team column:
from pyspark.sql import functions as F #extract last three characters from team column df_new = df.withColumn('last3', F.substring('team', -3, 3)) #view updated DataFrame df_new.show() +--------------+------+-----+ | team|points|last3| +--------------+------+-----+ | Dallas Mavs| 18| avs| | Brooklyn Nets| 33| ets| | Atlanta Hawks| 12| wks| |Boston Celtics| 15| ics| | Miami Heat| 19| eat| |Cleveland Cavs| 24| avs| | Orlando Magic| 28| gic| +--------------+------+-----+
Example 4: Extract Substring Before Specific Character
We can use the following syntax to extract all of the characters before the space from each string in the team column:
from pyspark.sql import functions as F #extract all characters before space in team column df_new = df.withColumn('beforespace', F.substring_index('team', ' ', 1)) #view updated DataFrame df_new.show() +--------------+------+-----------+ | team|points|beforespace| +--------------+------+-----------+ | Dallas Mavs| 18| Dallas| | Brooklyn Nets| 33| Brooklyn| | Atlanta Hawks| 12| Atlanta| |Boston Celtics| 15| Boston| | Miami Heat| 19| Miami| |Cleveland Cavs| 24| Cleveland| | Orlando Magic| 28| Orlando| +--------------+------+-----------+
Example 5: Extract Substring After Specific Character
We can use the following syntax to extract all of the characters after the space from each string in the team column:
from pyspark.sql import functions as F #extract all characters after space in team column df_new = df.withColumn('afterspace', F.substring_index('team', ' ', -1)) #view updated DataFrame df_new.show() +--------------+------+----------+ | team|points|afterspace| +--------------+------+----------+ | Dallas Mavs| 18| Mavs| | Brooklyn Nets| 33| Nets| | Atlanta Hawks| 12| Hawks| |Boston Celtics| 15| Celtics| | Miami Heat| 19| Heat| |Cleveland Cavs| 24| Cavs| | Orlando Magic| 28| Magic| +--------------+------+----------+
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Check if Column Contains String
PySpark: How to Replace String in Column
PySpark: How to Convert String to Integer