Often you may be interested in removing duplicated rows in a data frame in R. Fortunately this is easy to do using the distinct() function from the dplyr library.
library(dplyr)
This tutorial explains several examples of how to use this function in practice using the following data frame:
#create data frame df <- data.frame(x = c('a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'e'), y = c(1, 2, 2, 4, 4, 5, 9, 17, 17, 25)) #view data frame df x y 1 a 1 2 b 2 3 b 2 4 b 4 5 c 4 6 c 5 7 c 9 8 d 17 9 d 17 10 e 25
Example 1: Remove Completely Duplicated Rows
The following code shows how to remove rows that are complete duplicates of other rows:
#display only unique rows distinct(df) x y 1 a 1 2 b 2 3 b 4 4 c 4 5 c 5 6 c 9 7 d 17 8 e 25 #find total number of rows in original data frame nrow(df) [1] 10 #find total number of unique rows nrow(distinct(df)) [1] 8 #find total number of duplicate rows nrow(df) - nrow(distinct(df)) [1] 2
We can see that 2 duplicate rows were removed from the data frame.
Example 2: Remove Duplicates in One Column
The following code shows how to remove rows that have duplicates in one specific column of a data frame:
#display only unique values in column x distinct(df, x) x 1 a 2 b 3 c 4 d 5 e #display only unique values in column x distinct(df, y) y 1 1 2 2 3 4 4 5 5 9 6 17 7 25
You can also remove duplicate values in one column and still retain all other columns in the data frame:
#display only unique values in column x and retain other columns distinct(df, x, .keep_all = TRUE) x y 1 a 1 2 b 2 3 c 4 4 d 17 5 e 25 #display only unique values in column y and retain other columns distinct(df, y, .keep_all = TRUE) x y 1 a 1 2 b 2 3 b 4 4 c 5 5 c 9 6 d 17 7 e 25
You can find the complete documentation for the distinct() function here.