You can use the following syntax to select the first row by group in a PySpark DataFrame:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #group DataFrame by team column w = Window.partitionBy('team').orderBy(lit('A')) #filter DataFrame to only show first row for each team df.withColumn('row',row_number().over(w)).filter(col('row') == 1).drop('row').show()
This particular example selects the first row for each unique team value in the DataFrame.
The following example shows how to use this syntax in practice.
Example: How to Select First Row of Each Group in PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 22], ['A', 'Forward', 22], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 14], ['C', 'Forward', 23], ['C', 'Guard', 30]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 14| | C| Forward| 23| | C| Guard| 30| +----+--------+------+
Suppose we would like to select the first row for each unique team.
We can use the following syntax to do so:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #group DataFrame by team column w = Window.partitionBy('team').orderBy(lit('A')) #filter DataFrame to only show first row for each team df.withColumn('row',row_number().over(w)).filter(col('row') == 1).drop('row').show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | B| Guard| 14| | C| Forward| 23| +----+--------+------+
The resulting DataFrame shows only the first row for each unique team value.
Note #1: If you would like to group by multiple columns, simply include multiple column names in the partitionBy function.
Note #2: The syntax lit(‘A’) is simply used as an arbitrary value. You can replace ‘A’ with anything you’d like and the code will still work.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column