How to Remove Duplicate Rows in R


Often you may be interested in removing duplicated rows in a data frame in R. Fortunately this is easy to do using the distinct() function from the dplyr library.

library(dplyr)

This tutorial explains several examples of how to use this function in practice using the following data frame:

#create data frame
df <- data.frame(x = c('a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'e'),
                 y = c(1, 2, 2, 4, 4, 5, 9, 17, 17, 25))

#view data frame
df

   x  y
1  a  1
2  b  2
3  b  2
4  b  4
5  c  4
6  c  5
7  c  9
8  d 17
9  d 17
10 e 25

Example 1: Remove Completely Duplicated Rows

The following code shows how to remove rows that are complete duplicates of other rows:

#display only unique rows
distinct(df)

  x  y
1 a  1
2 b  2
3 b  4
4 c  4
5 c  5
6 c  9
7 d 17
8 e 25

#find total number of rows in original data frame
nrow(df)

[1] 10

#find total number of unique rows
nrow(distinct(df))

[1] 8

#find total number of duplicate rows
nrow(df) - nrow(distinct(df)) 

[1] 2

We can see that 2 duplicate rows were removed from the data frame.

Example 2: Remove Duplicates in One Column

The following code shows how to remove rows that have duplicates in one specific column of a data frame:

#display only unique values in column x
distinct(df, x)

  x
1 a
2 b
3 c
4 d
5 e

#display only unique values in column x
distinct(df, y)

   y
1  1
2  2
3  4
4  5
5  9
6 17
7 25

You can also remove duplicate values in one column and still retain all other columns in the data frame:

#display only unique values in column x and retain other columns
distinct(df, x, .keep_all = TRUE)

  x  y
1 a  1
2 b  2
3 c  4
4 d 17
5 e 25

#display only unique values in column y and retain other columns
distinct(df, y, .keep_all = TRUE)

  x  y
1 a  1
2 b  2
3 b  4
4 c  5
5 c  9
6 d 17
7 e 25

You can find the complete documentation for the distinct() function here.

Leave a Reply

Your email address will not be published. Required fields are marked *