dplyr: How to Perform Left Join and Define NA Values


When you perform a left join in dplyr, you are able to return all rows from one particular data frame and then only the rows from a second data frame that have matching values in a particular column.

When matching values cannot be found in the second data frame, NA values are returned in the final joined data frame.

Often you may want to change these NA values to a different value.

You can use the coalesce() function with the following syntax to do so:

library(dplyr)

final_df <- left_join(df_A, df_B, by=c('team')) %>%
  mutate_if(is.numeric,coalesce,0)

This particular example will perform a left join on the data frames called df_A and df_B, joining on matching values in the team column.

If any NA values are returned from the join, then they are replaced by the value 0 instead.

The following example shows how to use this syntax in practice.

Note: Before using the left_join() function you will need to first make sure that the dplyr package is installed on your computer. You can use the following syntax to install the package:

install.package('dplyr')

Once dplyr has successfully been installed, you can use the left_join() function without encountering any package errors.

Example: Perform Left Join and Define NA Values in dplyr

Suppose we have the following two data frames in R:

#create first data frame
df_A <- data.frame(team=c('A', 'B', 'C', 'D', 'E'),
                   points=c(22, 25, 19, 14, 38))

df_A

  team points
1    A     22
2    B     25
3    C     19
4    D     14
5    E     38

#create second data frame
df_B <- data.frame(team=c('A', 'C', 'D', 'F', 'G'),
                   rebounds=c(14, 8, 8, 6, 9))

df_B

  team rebounds
1    A       14
2    C        8
3    D        8
4    F        6
5    G        9

We can use the following syntax in dplyr to perform a left join based on matching values in the team columns:

library(dplyr)

#perform left join
final_df <- left_join(df_A, df_B, by=c('team'))

#view final data frame
final_df

  team points rebounds
1    A     22       14
2    B     25       NA
3    C     19        8
4    D     14        8
5    E     38       NA

The resulting data frame contains all rows from df_A and only the rows in df_B where the team values matched.

Note that there were two rows in the df_A data frame that did not have matching team values in the df_B data frame. Each of these rows in the final data frame have a value of NA in the rebounds column.

Suppose that we would like to perform the same left join but return a value of 0 instead of NA in the final joined data frame.

We can use the following syntax to do so:

library(dplyr)

#perform left join and replace any NA values with 0
final_df <- left_join(df_A, df_B, by=c('team')) %>%
  mutate_if(is.numeric,coalesce,0)

  team points rebounds
1    A     22       14
2    B     25        0
3    C     19        8
4    D     14        8
5    E     38        0

Notice that this returns the same results as the previous example except any NA values in the rebounds column have now been replaced with 0.

Feel free to replace 0 with any other number that you would like to use.

Note: You can find the complete documentation for the left_join() function in dplyr here.

Additional Resources

The following tutorials explain how to perform other common operations in R:

How to Do a Left Join in R
How to Do a Right Join in R
How to Do an Inner Join in R
How to Do an Outer Join in R

Leave a Reply

Your email address will not be published. Required fields are marked *