How to Use the gregexpr() Function in R


Often you may want to match strings in a character vector in R using regular expression patterns.

One of the easiest ways to do so is by using the gregexpr() function from base R, which can be used to perform this exact task.

The gregexpr() function uses the following syntax:

gregexpr(pattern, text, ignore.case = FALSE, …)

where:

  • pattern: The regular expression pattern to match
  • text: Name of character vector to find matches in
  • ignore.case: Whether to use case-sensitive matching or not

Note that the default option for the ignore.case argument is FALSE, which means that the function will perform case-sensitive matching by default.

Feel free to specify ignore.case=TRUE to instead “ignore” whether the case matches and perform case-insensitive matching instead.

The following examples show how to use the gregexpr() function in practice in several different scenarios in R.

Example: How to Use gregexpr() Function in R

Suppose we create the following data frame in R that contains information about various basketball players including their team name and total number of points scored:

#create data frame
df <- data.frame(team=c('Mavs', 'mavs', 'Heat', 'heat', 'Magic', 'magic', 'Nets'),
                 points=c(22, 26, 40, 23, 19, 16, 35))

#view data frame
df

   team points
1  Mavs     22
2  mavs     26
3  Heat     40
4  heat     23
5 Magic     19
6 magic     16
7  Nets     35

Suppose that we would like to determine if each string in the team column contains the substring “Ma” or not.

We can use the following syntax to do so:

#check if each value in team column contains "Ma"
result <- gregexpr("Ma", df$team)

#extract match.length attribute from gregexpr function
sapply(result, function(y) attr(y, "match.length")>0)

[1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

Note that the gregexpr() function returns a list of numeric vectors with various attributes.

For this example, we are only interested in retrieving the match.length attribute, which tells us the number of matching characters in each string based on the regular expression pattern that we specified.

Here is how to interpret the output:

  • The first value is TRUE because the team name “Mavs” does contain “Ma” in the string.
  • The first value is FALSE because the team name “ma” does not contain “Ma” in the string.
  • The first value is FALSE because the team name “Heat” does not contain “Ma” in the string.
  • The first value is FALSE because the team name “heat” does not contain “Ma” in the string.

And so on.

Note that we could specify ignore.case=TRUE to perform case-insensitive string matching instead:

#check if each value in team column contains "Ma", case-insensitive
result <- gregexpr("Ma", df$team, ignore.case=TRUE)

#extract match.length attribute from gregexpr function
sapply(result, function(y) attr(y, "match.length")>0)

[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE

Here is how to interpret the output:

  • The first value is TRUE because the team name “Mavs” does contain “Ma” in the string.
  • The first value is TRUE  because the team name “mavs” does contain “Ma” (case-insensitive) in the string.
  • The first value is FALSE because the team name “Heat” does not contain “Ma” in the string.
  • The first value is FALSE because the team name “heat” does not contain “Ma” in the string.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Create a Frequency Table by Group in R
How to Create a Frequency Polygon in R
How to Create Relative Frequency Tables in R

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *