How to Use Merge() to Easily Join Two Data Frames in R

R Guides

One of the most common functions you may want to perform when working with data is to merge two data frames together into one data frame. This is often necessary because two data frames may contain similar, but not identical, information and you would like all of the information to be contained within one data frame.

One of the easiest ways to merge two data frames in R is by using the merge() function, which uses the following syntax:

merge(x, y, by = column_name(s), by.x = by, by.y = by, all.x = all, all.y = all)

  • x, y: the two data frames to be coerced into one.
  • by, by.x, by.y: the column names to be used for merging.
  • all.x: If TRUE, a left join will be performed; all rows from data frame will be kept.
  • all.y: If TRUE, a right join will be performed; all rows from data frame will be kept.

The following examples illustrate how to use the merge() function to perform the four most common types of joins:

  • Inner join: return only the rows in which the left data frame have matching keys in the right data frame.
  • Left join: return all rows from the left data frame and any rows with matching keys from the right data frame.
  • Right join: return all rows from the right data frame and any rows with matching keys from the left data frame.
  • Outer join: return all rows from both data frames, join records from the left data frame which have matching keys in the right data frame.

Using Merge() to Perform Joins in R

For each of the following examples, we will use two data frames that contain the following information:

data1: basketball team name (team), average points per game (points), average assists per game (assists)

data2: basketball team name (team), average rebounds per game (rebounds), average turnovers per game (turnovers)

#create and view first data frame
data1 <- data.frame(team = LETTERS[1:5],
                    points = runif(5, 90, 105),
                    assists = runif(5, 20, 30))
data1

#  team    points  assists
#1    A 101.57688 23.68043
#2    B  91.79446 24.32832
#3    C  90.84957 28.39453
#4    D  94.09115 22.39954
#5    E  94.33483 27.63217
#create and view second data frame
data2 <- data.frame(team = LETTERS[3:7],
                    rebounds = runif(5, 30, 45),
                    turnovers = runif(5, 10, 20))
data2

#  team rebounds turnovers
#1    C 40.00582  19.02525
#2    D 37.97121  12.17256
#3    E 36.93537  10.60345
#4    F 30.87085  15.71121
#5    G 30.35308  14.54386

Inner Join

An inner join returns only the rows in which the left data frame have matching keys in the right data frame. The following code illustrates how to perform an inner join on the two data frames using the “team” column:

merge(data1, data2, by = 'team')

#  team   points  assists rebounds turnovers
#1    C 90.84957 28.39453 40.67105  12.93467
#2    D 94.09115 22.39954 39.86002  12.60839
#3    E 94.33483 27.63217 41.07858  19.19623

Left Join

left join returns all rows from the left data frame and any rows with matching keys from the right data frame. The following code illustrates how to perform a left join on the two data frames using the “team” column:

merge(data1, data2, by = 'team', all.x = TRUE)

#  team    points  assists rebounds turnovers
#1    A  96.58371 24.37455       NA        NA
#2    B  92.17977 25.27406       NA        NA
#3    C 102.89507 24.33644 44.48888  16.60444
#4    D  94.98907 24.88746 35.93058  10.71856
#5    E  95.15227 21.96998 42.70577  14.49243

Right Join

right join returns all rows from the right data frame and any rows with matching keys from the left data frame. The following code illustrates how to perform a right join on the two data frames using the “team” column:

merge(data1, data2, by = 'team', all.y = TRUE)

#  team   points  assists rebounds turnovers
#1    C  91.6296 28.59538 41.69953  12.89011
#2    D 101.8981 20.09634 43.39780  11.79409
#3    E 102.9434 24.69419 30.94638  17.91971
#4    F       NA       NA 30.47061  13.84032
#5    G       NA       NA 41.78819  16.31007

Outer Join

An outer join returns all rows from both data frames, joining records from the left data frame which have matching keys in the right data frame.. The following code illustrates how to perform an outer join on the two data frames:

merge(data1, data2, all = TRUE)

#  team    points  assists rebounds turnovers
#1    A  91.17328 24.55748       NA        NA
#2    B 103.38219 28.83283       NA        NA
#3    C  95.03470 24.85125 32.48333  19.50307
#4    D 100.93490 20.29873 42.66343  18.99066
#5    E 103.43068 24.30264 42.55551  19.44984
#6    F        NA       NA 38.24036  18.53101
#7    G        NA       NA 32.42657  12.44713

Performing a Merge on Data Frames with Different Columns Names

The merge() function can also join two data frames that don’t share a common column name. The following code illustrates how to join two data frames that share a similar column team, but have slightly different column names:

#create first data frame
data1 <- data.frame(team = LETTERS[1:5],
                    points = runif(5, 90, 105),
                    assists = runif(5, 20, 30))

#create second data frame
data2 <- data.frame(team_name = LETTERS[3:7],
                    rebounds = runif(5, 30, 45),
                    turnovers = runif(5, 10, 20))

#inner join the two data frames, specifying which column name to join on for each
#data frame
merge(data1, data2, by.x = 'team', by.y = 'team_name')

#  team    points  assists rebounds turnovers
#1    C  91.65900 25.11648 44.73535  19.10941
#2    D 103.32704 21.20736 33.43049  18.22008
#3    E  90.06115 25.73976 35.92924  10.50431

Performing a Merge on Data Frames Using Multiple Columns

You can also use the merge() function to join two data frames based on multiple columns.The following code illustrates how to join two data frames based on team and city:

#create data frame 1
data1 <- data.frame(team = LETTERS[1:5],
                    city = LETTERS[1:5],
                    points = runif(5, 90, 105),
                    assists = runif(5, 20, 30))

#create data frame 2
data2 <- data.frame(TEAM = LETTERS[3:7],
                    CITY = LETTERS[3:7],
                    rebounds = runif(5, 30, 45),
                    turnovers = runif(5, 10, 20))

#inner join the two data frames, specifying multiple column names to join
#on for each data frame
merge(data1, data2, by.x = c('team', 'city'), by.y = c('TEAM', 'CITY'))

#  team city   points  assists rebounds turnovers
#1    C    C 95.34430 21.75860 37.74131  12.53848
#2    D    D 93.68527 28.28567 38.52356  14.18106
#3    E    E 91.53751 25.21560 30.47553  11.73712

 

Leave a Reply

Your email address will not be published. Required fields are marked *