Pandas: How to Drop Duplicates and Keep Latest


You can use the following basic syntax to drop duplicates from a pandas DataFrame but keep the row with the latest timestamp:

df = df.sort_values('time').drop_duplicates(['item'], keep='last')

This particular example drops rows with duplicate values in the item column, but keeps the row with the latest timestamp in the time column.

The following example shows how to use this syntax in practice.

Example: Drop Duplicates and Keep Latest in Pandas

Suppose we have the following pandas DataFrame that contains information about the sales of various fruits at some grocery store:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'time': ['2022-10-25 04:00:00', '2022-10-25 11:55:12',
                            '2022-10-26 02:00:00', '2022-10-27 10:30:00',
                            '2022-10-27 14:25:00', '2022-10-28 01:15:27'],
                   'item': ['apple', 'orange', 'apple', 'mango', 'mango', 'kiwi'],
                   'sales': [18, 22, 19, 14, 14, 11]})

#convert time column to datetime dtype
df['time'] = pd.to_datetime(df['time'])

#view DataFrame
print(df)

                 time    item  sales
0 2022-10-25 04:00:00   apple     18
1 2022-10-25 11:55:12  orange     22
2 2022-10-26 02:00:00   apple     19
3 2022-10-27 10:30:00   mango     14
4 2022-10-27 14:25:00   mango     14
5 2022-10-28 01:15:27    kiwi     11

Suppose we would like to remove all rows with duplicate values in the item column but keep the row with the latest timestamp in the time column.

We can use the following syntax to do so:

#drop duplicate rows based on value in 'item' column but keep latest timestamp
df = df.sort_values('time').drop_duplicates(['item'], keep='last')

#view updated DataFrame
print(df)

                 time    item  sales
1 2022-10-25 11:55:12  orange     22
2 2022-10-26 02:00:00   apple     19
4 2022-10-27 14:25:00   mango     14
5 2022-10-28 01:15:27    kiwi     11

Notice that the item column had multiple rows with ‘apple’ and ‘mango’ as values.

Each of these duplicate rows were removed but the row with the latest timestamp in the time column was kept.

Note: If you would like to remove rows based on duplicate values in multiple columns, simply include multiple column names in the first argument of the drop_duplicates() function.

Additional Resources

The following tutorials explain how to perform other common tasks in pandas:

How to Convert Datetime to Date in Pandas
How to Convert Columns to DateTime in Pandas
How to Sort a Pandas DataFrame by Date

Leave a Reply

Your email address will not be published. Required fields are marked *