Regular Expressions in R

Regular expressions in R

Often when you’re working with datasets, you may want to search or replace strings. Fortunately R has many built-in functions to help you do so.

This tutorial explains how to use the following functions to search for matches:

  • grep()
  • grepl()

This tutorial also explains how to use the following functions for performing replacements:

  • sub()
  • gsub()

Using grep() and grepl() to search for strings

The function grep() returns an integer vector of indices of the elements of a vector that match some pattern. The basic syntax for grep() is as follows:

grep(pattern, x, value = FALSE)

  • pattern: character string to be matched in a given vector
  • x: a vector where matches are sought
  • value: if FALSE, a vector containing the indices of the matches is returned, and if TRUE, a vector containing the matching elements themselves is returned.

The following code illustrates an example of using grep() to find the indices of elements in a vector that match some pattern:

#create a character vector x with four names
x <- c('bob', 'adam', 'doug', 'larry', 'harry')

#return a vector of indices that match the pattern 'arry'
grep('arry', x)

#[1] 4 5

#return the actual elements that match the pattern 'arry'
grep('arry', x, value = TRUE)

#[1] "larry" "harry"

The function grepl() returns a logical vector of indices of the elements of a vector that match some pattern. The basic syntax for grepl() is as follows:

grepl(pattern, x)

  • pattern: character string to be matched in a given vector
  • x: a vector where matches are sought

The following code illustrates an example of using grepl() to find the indices of elements in a vector that match some pattern:

#create a character vector x with four names
x <- c('bob', 'adam', 'doug', 'larry', 'harry')

#return a vector of indices that match the pattern 'arry'
grepl('arry', x)

#[1] FALSE FALSE FALSE TRUE TRUE

Using sub() and gsub() to replace strings

The function sub() replaces the first occurrence of a substring with another user-specified substring. The basic syntax for sub() is as follows:

sub(pattern, replacement, x)

  • pattern: character string to be searched for in a given vector
  • replacement: a replacement string for a matched pattern
  • x: a vector where matches are sought

The following code illustrates an example of using sub() to replace the first occurrence of a substring with another user-specified substring: 

#create a sentence
sentence <- 'Jessica likes Hawaii. She would like to live in Hawaii some day.'

#replace the first occurrence of 'Hawaii' with 'HI'
sub('Hawaii', 'HI', sentence)

#[1] "Jessica likes HI. She would like to live in Hawaii some day."

The function gsub() replaces all occurrences of a substring with another user-specified substring. The basic syntax for gsub() is as follows:

gsub(pattern, replacement, x)

  • pattern: character string to be searched for in a given vector
  • replacement: a replacement string for a matched pattern
  • x: a vector where matches are sought

The following code illustrates an example of using gsub() to replace all occurrences of a substring with another user-specified substring: 

#create a sentence
sentence <- 'Jessica likes Hawaii. She would like to live in Hawaii some day.'

#replace all occurrences of 'Hawaii' with 'HI'
gsub('Hawaii', 'HI', sentence)

#[1] "Jessica likes HI. She would like to live in HI some day."

You can also use regular expressions with gsub(). The following code illustrates how to replace all digits in a sentence with blanks using the regular expression \\d, which represents digits.

#create a sentence
sentence <- 'Jessica likes Hawaii in 2019. She wants to live in Hawaii in 2025.'

#replace all digits with blanks
gsub('\\d', '', sentence)

#[1] "Jessica likes Hawaii in . She wants to live in Hawaii in ."

Check out this Regular Expression Cheat Sheet to find several different regular expressions you can use with the gsub() function. 

Leave a Reply

Your email address will not be published. Required fields are marked *