You can use the following methods to find the max date (i.e. the latest date) in a column of a PySpark DataFrame:
Method 1: Find Max Date in One Column
from pyspark.sql import functions as F
#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()
Method 2: Find Max Date in One Column, Grouped by Another Column
from pyspark.sql import functions as F
#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()
The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about sales made by various employees at some company:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', '2020-10-25', 15], ['A', '2013-10-11', 24], ['A', '2015-10-17', 31], ['B', '2022-12-21', 27], ['B', '2021-04-14', 40], ['B', '2021-06-26', 34]] #define column names columns = ['employee', 'sales_date', 'total_sales'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +--------+----------+-----------+ |employee|sales_date|total_sales| +--------+----------+-----------+ | A|2020-10-25| 15| | A|2013-10-11| 24| | A|2015-10-17| 31| | B|2022-12-21| 27| | B|2021-04-14| 40| | B|2021-06-26| 34| +--------+----------+-----------+
Example 1: Find Max Date in One Column
We can use the following syntax to find the maximum date (i.e. the earliest date) in the sales_date column:
from pyspark.sql import functions as F
#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()
+----------+
| max_date|
+----------+
|2022-12-21|
+----------+
We can see that the max date in the sales_date column is 2022-12-21.
Note: We used the alias function to rename the column to max_date in the resulting DataFrame.
Example 2: Find Max Date in One Column, Grouped by Another Column
We can use the following syntax to find the max date in the sales_date column, grouped by the values in the employee column:
from pyspark.sql import functions as F
#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()
+--------+----------+
|employee| max_date|
+--------+----------+
| A|2020-10-25|
| B|2022-12-21|
+--------+----------+
The resulting DataFrame shows the max sales date (i.e. latest date) for each unique employee in the DataFrame.
Note: You can also include multiple column names in the groupBy function to group by multiple columns if you’d like.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark