You can use the following syntax to replace zeros with null values in a PySpark DataFrame:
df_new = df.replace(0, None)
The following examples show how to use this syntax in practice.
Example: Replace Zero with Null in PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 0],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 0],
['B', 'Forward', 13],
['B', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 0|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 0|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
We can use the following syntax to replace each zero with a null in the DataFrame:
#create new DataFrame that replaces all zeros with null df_new = df.replace(0, None) #view new DataFrame df_new.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| null| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| null| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Notice that each zero in the points column has been replaced with a value of null.
If we’d like, we can use the following syntax to count the number of null values present in the points column of the new DataFrame:
#count number of null values in 'points' column
df_new.where(df_new.points.isNull()).count()
2
From the output we can see that there are 2 null values in the points column of the new DataFrame.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Replace String in Column
PySpark: How to Use fillna() with Another Column
PySpark: How to Use fillna() with Specific Columns