You can use the following syntax to remove leading zeros from a column in a PySpark DataFrame:
from pyspark.sql import functions as F
#remove leading zeros from values in 'employee_ID' column
df_new = df.withColumn('employee_ID', F.regexp_replace('employee_ID', r'^[0]*', ''))
This particular example removes all leading zeros from values in the employee_ID column and leaves all other zeros untouched.
The following example shows how to use this syntax in practice.
Example: How to Remove Leading Zeros from Column in PySpark
Suppose we have the following PySpark DataFrame that contains information about sales made by various employees at some company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['000501', 18],
['000034', 33],
['009230', 12],
['000451', 15],
['000239', 19],
['002295', 24],
['011543', 28]]
#define column names
columns = ['employee_ID', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-----------+-----+
|employee_ID|sales|
+-----------+-----+
| 000501| 18|
| 000034| 33|
| 009230| 12|
| 000451| 15|
| 000239| 19|
| 002295| 24|
| 011543| 28|
+-----------+-----+
Notice that each string in the employee_ID column contains leading zeros.
We can use the following syntax to remove the leading zeros from each string in this column:
from pyspark.sql import functions as F
#remove leading zeros from values in 'employee_ID' column
df_new = df.withColumn('employee_ID', F.regexp_replace('employee_ID', r'^[0]*', ''))
#view updated DataFrame
df_new.show()
+-----------+-----+
|employee_ID|sales|
+-----------+-----+
| 501| 18|
| 34| 33|
| 9230| 12|
| 451| 15|
| 239| 19|
| 2295| 24|
| 11543| 28|
+-----------+-----+
Notice that the leading zeros have been removed from each string in the employee_ID column.
Note that we used the PySpark regexp_replace function to replace the leading zeros in each string with nothing.
You can find the complete documentation for the PySpark regexp_replace function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Remove Special Characters from Column
PySpark: How to Remove Specific Characters from Strings
PySpark: How to Remove Spaces from Column Names