How to Find Minimum Date in PySpark (With Examples)


You can use the following methods to find the minimum date (i.e. the earliest date) in a column of a PySpark DataFrame:

Method 1: Find Minimum Date in One Column

from pyspark.sql import functions as F

#find minimum date in sales_date column
df.select(F.min('sales_date').alias('min_date')).show()

Method 2: Find Minimum Date in One Column, Grouped by Another Column

from pyspark.sql import functions as F

#find minimum date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.min('sales_date').alias('min_date')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about sales made by various employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2020-10-25', 15],
        ['A', '2013-10-11', 24],
        ['A', '2015-10-17', 31],
        ['B', '2022-12-21', 27],
        ['B', '2021-04-14', 40],
        ['B', '2021-06-26', 34]] 
  
#define column names
columns = ['employee', 'sales_date', 'total_sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+-----------+
|employee|sales_date|total_sales|
+--------+----------+-----------+
|       A|2020-10-25|         15|
|       A|2013-10-11|         24|
|       A|2015-10-17|         31|
|       B|2022-12-21|         27|
|       B|2021-04-14|         40|
|       B|2021-06-26|         34|
+--------+----------+-----------+

Example 1: Find Minimum Date in One Column

We can use the following syntax to find the minimum date (i.e. the earliest date) in the sales_date column:

from pyspark.sql import functions as F

#find minimum date in sales_date column
df.select(F.min('sales_date').alias('min_date')).show()

+----------+
|  min_date|
+----------+
|2013-10-11|
+----------+

We can see that the minimum date in the sales_date column is 2013-10-11.

Note: We used the alias function to rename the column to min_date in the resulting DataFrame.

Example 2: Find Minimum Date in One Column, Grouped by Another Column

We can use the following syntax to find the minimum date in the sales_date column, grouped by the values in the employee column:

from pyspark.sql import functions as F

#find minimum date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.min('sales_date').alias('min_date')).show()

+--------+----------+
|employee|  min_date|
+--------+----------+
|       A|2013-10-11|
|       B|2021-04-14|
+--------+----------+

The resulting DataFrame shows the minimum sales date (i.e. earliest date) for each unique employee in the DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark

Leave a Reply

Your email address will not be published. Required fields are marked *