dplyr: Select Columns Based on Multiple Strings


Often you may want to select columns in a data frame in R whose name contains one of several strings.

One of the easiest ways to do this is by using the matches() function from the dplyr package in R, which is designed to perform this exact task.

The matches() function uses the following basic syntax:

library(dplyr)

df %>%
  select(matches('pattern1|pattern2'))

This particular example will return all columns from the data frame named df that contain either the string pattern1 or pattern2 in the column name.

Note that we use the | operator as “OR” logic and we can use as many of these operators as we’d like within the matches() function to search for multiple strings in the column names.

The following example shows how to use the matches() function from the dplyr package in practice.

Note: If you don’t already have the dplyr package installed then you can use the following syntax to do so:

install.packages('dplyr')

Once you’ve installed the dplyr package, you will be able to use the matches() function.

Example: How to Select Columns Based on Multiple Strings in dplyr

Suppose we create the following data frame that contains information about various basketball players:

#create data frame
df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                 total_points=c(99, 68, 86, 88, 95, 74, 78, 93),
                 total_assists=c(22, 28, 45, 35, 34, 45, 28, 31),
                 total_rebounds=c(30, 28, 24, 24, 30, 36, 30, 29))

#view data frame
df

  team total_points total_assists total_rebounds
1    A           99            22             30
2    A           68            28             28
3    A           86            45             24
4    A           88            35             24
5    B           95            34             30
6    B           74            45             36
7    B           78            28             30
8    B           93            31             29

Suppose that we would like to select any columns that contain either “point” or “reb” in the column name.

We can use the matches() function from the dplyr package to do so:

library(dplyr)

#select any columns that contain 'point' or 'reb' in the name
df %>%
  select(matches('points|reb'))

  total_points total_rebounds
1           99             30
2           68             28
3           86             24
4           88             24
5           95             30
6           74             36
7           78             30
8           93             29

Notice that this returns the total_points and total_rebounds columns, which are the two columns that contain at least one of the strings that we specified in the matches() function.

Note that the select() function in dplyr is used to select specific columns from a data frame. By using the matches() function inside of this function, we are able to select columns that contain specific strings.

Feel free to use any Regex operators that you would like within the matches() function as well.

For example, we could use the ^ operator to select columns that “start with” a specific string.

We can use the following syntax to select all columns that either start with the string “total” or include the string “tea”:

library(dplyr)

#select any columns that start with 'total' or contain 'tea' in the name
df %>%
  select(matches('^total|tea'))

  team total_points total_assists total_rebounds
1    A           99            22             30
2    A           68            28             28
3    A           86            45             24
4    A           88            35             24
5    B           95            34             30
6    B           74            45             36
7    B           78            28             30
8    B           93            31             29

This returns all columns from the data frame because each of the columns in the data frame either starts with “total” or contains the string “tea” in the column name.

Feel free to use whatever Regex operators you’d like to select columns that match specific patterns.

Note: You can find the complete documentation for the matches() function from the dplyr package here.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Use the pull() Function in dplyr
How to Use slice_max() in dplyr
How to Use top_n() in dplyr
How to Add Row to Data Frame Using dplyr

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *