Pandas: How to Remove Duplicates but Keep Row with Max Value


You can use the following methods to remove duplicates in a pandas DataFrame but keep the row that contains the max value in a particular column:

Method 1: Remove Duplicates in One Column and Keep Row with Max

df.sort_values('var2', ascending=False).drop_duplicates('var1').sort_index()

Method 2: Remove Duplicates in Multiple Columns and Keep Row with Max

df.sort_values('var3', ascending=False).drop_duplicates(['var1', 'var2']).sort_index()

The following examples show how to use each method in practice.

Example 1: Remove Duplicates in One Column and Keep Row with Max

Suppose we have the following pandas DataFrame that contains information about points scored by basketball players on various teams:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                   'points': [20, 24, 28, 30, 14, 19, 29, 40, 22]})

#view DataFrame
print(df)

  team  points
0    A      20
1    A      24
2    A      28
3    B      30
4    B      14
5    B      19
6    C      29
7    C      40
8    C      22

We can use the following syntax to drop rows with duplicate team names but keep the rows with the max values for points:

#drop duplicate teams but keeps row with max points
df_new = df.sort_values('points', ascending=False).drop_duplicates('team').sort_index()

#view DataFrame
print(df_new)

  team  points
2    A      28
3    B      30
7    C      40

Each row with a duplicate team name has been dropped, but the rows with the max value for points have been kept for each team.

Example 2: Remove Duplicates in Multiple Columns and Keep Row with Max

Suppose we have the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                   'position': ['G', 'G', 'F', 'G', 'F', 'F', 'G', 'G', 'F'],
                   'points': [20, 24, 28, 30, 14, 19, 29, 40, 22]})

#view DataFrame
print(df)

  team position  points
0    A        G      20
1    A        G      24
2    A        F      28
3    B        G      30
4    B        F      14
5    B        F      19
6    C        G      29
7    C        G      40
8    C        F      22

We can use the following syntax to drop rows with duplicate team and position names but keep the rows with the max values for points:

#drop rows with duplicate team and positions but keeps row with max points
df_new = df.sort_values('points', ascending=False).drop_duplicates(['team', 'position']).sort_index()

#view DataFrame
print(df_new)

  team position  points
1    A        G      24
2    A        F      28
3    B        G      30
5    B        F      19
7    C        G      40
8    C        F      22

Each row with a duplicate team and position name has been dropped, but the rows with the max value for points have been kept for each combination of team and position.

Additional Resources

The following tutorials explain how to perform other common operations in pandas:

How to Drop Duplicate Rows in Pandas
How to Drop Duplicate Columns in Pandas
How to Count Duplicates in Pandas

Leave a Reply

Your email address will not be published.