How to Use setdiff() Function in dplyr


Often you may want to find all rows in one data frame that do not occur in another data frame in R.

Fortunately this is easy to do by using the setdiff() function from the dplyr package in R, which is designed to perform this exact task.

The setdiff() function uses the following basic syntax:

setdiff(x, y)

where:

  • x: The name of the first data frame
  • y: The name of the second data frame

Note that this function returns a data frame as a result.

Also note that a similar function is the union() function, which uses the same syntax and will return all rows that occur in either data frame.

The following example shows how to use the setdiff() function from the dplyr package in practice.

Note: Before using the setdiff() function, you may need to first install the dplyr package by using the following syntax:

install.packages('dplyr')

Once the dplyr package is installed, you can use the setdiff() function.

Example: How to Use the setdiff() Function in dplyr

Suppose we create the following two data frames named df1 and df2:

#create first data frame
df1 <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                  points=c(14, 14, 19, 25, 40, 34, 38, 17))

df1

  team points
1    A     14
2    A     14
3    A     19
4    A     25
5    B     40
6    B     34
7    B     38
8    B     17

#create second data frame
df2 <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                  points=c(14, 10, 11, 15, 10, 32, 38, 27))

df2

  team points
1    A     14
2    A     10
3    A     11
4    A     15
5    B     10
6    B     32
7    B     38
8    B     27

Suppose that we would like to find all rows in df1 that do not occur in df2.

We can use the setdiff() function from the dplyr package to do so:

library(dplyr)

#find all rows in df1 that do not occur in df2
df_diff <- setdiff(df1, df2)

#view resulting data frame
df_diff

  team points
1    A     19
2    A     25
3    B     40
4    B     34
5    B     17

Notice that the new data frame named df_diff contains all rows that occur in df1 but do not occur in df2.

From the output we can see that a total of five rows occur in df1 that do not occur in df2.

If you would simply like to know the number of rows that occur in df1 and not df2, then you can wrap the nrow() function around the setdiff() function to return the number of resulting rows.

Note that the nrow() function is used to return the number of rows in a given data frame.

We can use the following syntax to return the number of rows that occur in df1 and not df2:

library(dplyr)

#return number of rows that occur in df1 and not df2
df_diff_num <- nrow(setdiff(df1, df2))

#view results
df_diff_num

[1] 5

This returns a value of 5, which tells us that there are five rows that occur in df1 that do not occur in df2. This matches the result from the previous example.

Note: You can find the complete documentation for the set_diff() function from the dplyr package here.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Use slice_min() in dplyr
How to Use the pull() Function in dplyr
How to Use top_n() in dplyr
How to Rename Columns Using dplyr

Leave a Reply

Your email address will not be published. Required fields are marked *