You can use the following methods to select rows based on column values in a PySpark DataFrame:
Method 1: Select Rows where Column is Equal to Specific Value
#select rows where 'team' column is equal to 'B' df.where(df.team=='B').show()
Method 2: Select Rows where Column Value is in List of Values
#select rows where 'team' column is equal to 'A' or 'B' df.filter(df.team.isin('A','B')).show()
Method 3: Select Rows Based on Multiple Column Conditions
#select rows where 'team' column is 'A' and 'points' column is greater than 9 df.where((df.team=='A') & (df.points>9)).show()
The following examples show how to use each of these methods in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11], ['A', 'East', 8], ['A', 'East', 10], ['B', 'West', 6], ['B', 'West', 6], ['C', 'East', 5]] #define column names columns = ['team', 'conference', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 8| | A| East| 10| | B| West| 6| | B| West| 6| | C| East| 5| +----+----------+------+
Example 1: Select Rows where Column is Equal to Specific Value
We can use the following syntax to select only the rows where the team column is equal to B:
#select rows where 'team' column is equal to 'B' df.where(df.team=='B').show() +----+----------+------+ |team|conference|points| +----+----------+------+ | B| West| 6| | B| West| 6| +----+----------+------+
Notice that only the rows where the team column is equal to B are returned.
Example 2: Select Rows where Column Value is in List of Values
We can use the following syntax to select only the rows where the team column is equal to A or B:
#select rows where 'team' column is equal to 'A' or 'B' df.filter(df.team.isin('A','B')).show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 8| | A| East| 10| | B| West| 6| | B| West| 6| +----+----------+------+
Notice that only the rows where the team column is equal to either A or B are returned.
Example 3: Select Rows Based on Multiple Column Conditions
We can use the following syntax to select only the rows where the team column is equal to A and the points column is greater than 9:
#select rows where 'team' column is 'A' and 'points' column is greater than 9 df.where((df.team=='A') & (df.points>9)).show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 10| +----+----------+------+
Notice that only the two rows that met both conditions are returned.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Find Unique Values in a Column