You can use the following syntax to get the last row from a PySpark DataFrame:
from pyspark.sql.functions import * #get last row of DataFrame last_row = df.withColumn('id', monotonically_increasing_id())\ .select(max(struct('id', *df.columns))\ .alias('x')).select(col('x.*')).drop('id'))
The following example shows how to use this syntax so in practice.
Example: How to Get Last Row from PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', 'East', 8, 9], ['A', 'East', 10, 3], ['B', 'West', 6, 12], ['B', 'West', 6, 4], ['C', 'East', 5, 2]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +----+----------+------+-------+
Suppose we would like to get the last row from the DataFrame.
We can use the following syntax to do so:
from pyspark.sql.functions import * #get last row of DataFrame last_row = df.withColumn('id', monotonically_increasing_id())\ .select(max(struct('id', *df.columns))\ .alias('x')).select(col('x.*')).drop('id')) #view last row last_row.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | C| East| 5| 2| +----+----------+------+-------+
We have successfully extracted only the last row from the DataFrame.
Here is how this syntax worked in a nutshell:
- First, we use the monotonically_increasing_id function to add a new column called id that contained monotonically increasing values.
- Next, we used the max function to select the row with the largest id value, which is guaranteed to be the last row in the id column.
- Lastly, we dropped the id column from the DataFrame.
The end result is that we were able to get only the last row from the DataFrame.
Note: You can find the complete documentation for the monotonically_increasing_id function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select First Row of Each Group
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column