You can use the following methods in PySpark to filter DataFrame rows where a value in a particular column is not null:
Method 1: Filter for Rows where Value is Not Null in Specific Column
#filter for rows where value is not null in 'points' column
df.filter(df.points.isNotNull()).show()
Method 2: Filter for Rows where Value is Not Null in Any Column
#filter for rows where value is not null in any column
df.dropna().show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', None, 8, 9], ['A', 'East', 10, 3], ['B', 'West', None, 12], ['B', 'West', None, 4], ['C', 'East', 5, 2]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| null| 8| 9| | A| East| 10| 3| | B| West| null| 12| | B| West| null| 4| | C| East| 5| 2| +----+----------+------+-------+
Example 1: Filter for Rows where Value is Not Null in Specific Column
We can use the following syntax to filter the DataFrame to only show rows where the value in the points column is not null:
#filter for rows where value is not null in 'points' column
df.filter(df.points.isNotNull()).show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| null| 8| 9|
| A| East| 10| 3|
| C| East| 5| 2|
+----+----------+------+-------+
The resulting DataFrame only contains rows where the value in the points column is not null.
Example 2: Filter for Rows where Value is Not Null in Any Column
We can use the following syntax to filter the DataFrame to only show rows where there are no null values in any column:
#filter for rows where value is not null in any column
df.dropna().show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 10| 3|
| C| East| 5| 2|
+----+----------+------+-------+
The resulting DataFrame only contains rows where there are no null values in any column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Use “IS NOT IN” in PySpark
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values