How to Find and Remove Duplicate Data in R

How to find and remove duplicate rows in R

One of the most common problems you will likely come across when working with datasets is duplicate data. This tutorial explains how to find and remove duplicate data using several different techniques.

Finding & Removing Duplicate Data Using Base R

Suppose we have the following data frame with some duplicate rows:

#create fake dataset with two columns: "team" and "score"
team <- c(rep("A", 3), rep("B", 3), rep("C",2))
score <- c(1,1,2,4,1,1,2,2)
data <- data.frame(team, score)

#view data
data

# team score
#1  A    1
#2  A    1
#3  A    2
#4  B    4
#5  B    1
#6  B    1
#7  C    2
#8  C    2 

Using the built-in R function duplicated(), we can determine if each row is a duplicate or not:

#find if each row is a duplicate of a previous row in the dataset
duplicated(data)

#[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE

If we would like to return the data frame with only the duplicate rows, we can use the following syntax:

#return data frame with only duplicate rows included
data[duplicated(data), ]

# team score
#2 A    1
#6 B    1
#8 C    2

If we would like to return the data frame with only the non-duplicate rows (i.e. just the unique rows), we can use the following syntax:

#return data frame with only non-duplicate rows included
data[!duplicated(data), ]

# team score
#1 A    1
#3 A    2
#4 B    4
#5 B    1
#7 C    2

Finding & Removing Duplicate Data Using dplyr

We can also find and remove duplicate data using the distinct() function in the dplyr package in R.

#create fake dataset with two columns: "team" and "score"
team <- c(rep("A", 3), rep("B", 3), rep("C",2))
score <- c(1,1,2,4,1,1,2,2)
data <-data.frame(team, score)

#find all distinct rows in data frame
data %>% distinct()

# team score
#1 A    1
#2 A    2
#3 B    4
#4 B    1
#5 C    2

#find all distinct values of one specific column
data %>% distinct(team)

# team
#1 A
#2 B
#3 C

#find all distinct values of one specific column, but also keep all other columns
data %>% distinct(team, .keep_all = TRUE)

# team score
#1 A    1
#2 B    4
#3 C    2

Finding & Removing Duplicate Data Using data.table

Yet another way to find and remove duplicate data is by using the duplicated() and unique() functions in the data.table package in R.

#install (if not already installed) and load data.table package
if(!require(data.table)){install.packages('data.table')}

#create fake data table with two columns: "id" and "val"
DT <- data.table(id = c(1,1,1,2,2,2),
                 val = c(10,20,30,10,20,30))
DT

#  id val
#1: 1 10
#2: 1 20
#3: 1 30
#4: 2 10
#5: 2 20
#6: 2 30

#find if each row is a duplicate of a previous row in the dataset by column "id"
duplicated(DT, by = "id")

#[1] FALSE TRUE TRUE FALSE TRUE TRUE

#find first unique value by id
unique(DT, by = "id")

#  id val
#1: 1 10
#2: 2 10

Leave a Reply

Your email address will not be published. Required fields are marked *