How to Use union() Function in dplyr


Often you may want to return all rows that exist in either one of two data frames in R.

Fortunately this is easy to do by using the union() function from the dplyr package in R, which is designed to perform this exact task.

The union() function uses the following basic syntax:

union(x, y)

where:

  • x: The name of the first data frame
  • y: The name of the second data frame

Note that this function returns a data frame as a result.

Also note that this data frame excludes any duplicate rows from being returned.

If you would like duplicate rows to be shown, you can use the union_all() function instead, which uses the same syntax but will return all duplicate rows in the resulting data frame.

Feel free to use whichever function you would like depending on your end goal.

The following example shows how to use the union() function from the dplyr package in practice.

Note: Before using the union() function, you may need to first install the dplyr package by using the following syntax:

install.packages('dplyr')

Once the dplyr package is installed, you can use the union() function.

Example: How to Use the union() Function in dplyr

Suppose we create the following two data frames named df1 and df2:

#create first data frame
df1 <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                  points=c(14, 14, 19, 25, 40, 34, 38, 17))

df1

  team points
1    A     14
2    A     14
3    A     19
4    A     25
5    B     40
6    B     34
7    B     38
8    B     17

#create second data frame
df2 <- data.frame(team=c('C', 'C', 'D', 'D', 'D', 'E', 'E', 'E'),
                  points=c(14, 10, 11, 15, 10, 32, 28, 27))

df2

  team points
1    C     14
2    C     10
3    D     11
4    D     15
5    D     10
6    E     32
7    E     28
8    E     27

Suppose that we would like to return a single data frame that contains all rows that occur in either data frame.

We can use the union() function from the dplyr package to do so:

library(dplyr)

#return all rows that occur in either data frame
df_all <- union(df1, df2)

#view resulting data frame
df_all

   team points
1     A     14
2     A     19
3     A     25
4     B     40
5     B     34
6     B     38
7     B     17
8     C     14
9     C     10
10    D     11
11    D     15
12    D     10
13    E     32
14    E     28
15    E     27

Notice that the new data frame named df_all contains all rows that occur in either data frame.

Note that the first data frame contained 8 total rows, the second data frame contained 8 total rows, and the final resulting data frame contains 15 total rows.

We can verify that every single row from each data frame is returned except for the one duplicate row in the second row of the first data frame.

If you would like to include all duplicate rows in the resulting data frame, you could use the union_all() function instead, which will return all rows that occur in either data frame, including duplicate rows.

Feel free to use whichever function that you prefer.

Note: You can find the complete documentation for the union() function from the dplyr package here.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Use slice_min() in dplyr
How to Use the pull() Function in dplyr
How to Use top_n() in dplyr
How to Rename Columns Using dplyr

Leave a Reply

Your email address will not be published. Required fields are marked *