How to Sample by Group Using dplyr


Often you may want to select a random sample of rows by group in R.

Fortunately this is easy to do by using the sample_n() function along with the group_by() function from the dplyr package in R, which is designed to perform this exact task.

The sample_n() function uses the following basic syntax:

sample_n(tbl, size, replace=FALSE, …)

where:

  • tbl: The name of the data frame
  • size: The number of rows to select
  • replace: Whether to sample with replacement

Note that in most cases you will want to leave the value for the replace argument set to FALSE since you often don’t want to sample with replacement, i.e. allowing the same row to be included in the sample multiple times.

The following example shows how to use the sample_n() function along with the group_by() function from the dplyr package to select a random sample of rows by group.

Example: How to Sample by Group Using dplyr

Suppose we create the following data frame that contains information about various basketball players:

#create data frame
df <- data.frame(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                 points=c(99, 68, 86, 88, 95, 74, 78, 93),
                 assists=c(22, 28, 45, 35, 34, 45, 28, 31),
                 rebounds=c(30, 28, 24, 24, 30, 36, 30, 29))

#view data frame
df

  team points assists rebounds
1    A     99      22       30
2    A     68      28       28
3    A     86      45       24
4    A     88      35       24
5    B     95      34       30
6    B     74      45       36
7    B     78      28       30
8    B     93      31       29

Notice that there are two unique teams in this data frame: A and B.

Suppose that we would like to select three random basketball players from each of these teams.

We can use the following syntax to do so:

library(dplyr)

#select three random players from each team
df %>%
  group_by(team) %>%
  sample_n(size=3)

# A tibble: 6 x 4
# Groups:   team [2]
  team  points assists rebounds
           
1 A         86      45       24
2 A         99      22       30
3 A         88      35       24
4 B         78      28       30
5 B         93      31       29
6 B         95      34       30

This returns three random players from each team.

Note that we can specify a different value for the size argument of the sample_n() function to instead return a different number of players per team.

For example, we can use the following syntax to return two random players from each team instead:

library(dplyr)

#select two random players from each team
df %>%
  group_by(team) %>%
  sample_n(size=2)

# A tibble: 4 x 4
# Groups:   team [2]
  team  points assists rebounds
           
1 A         88      35       24
2 A         99      22       30
3 B         78      28       30
4 B         93      31       29

This returns two random players from each team, just as we specified.

Note that each time we run this code the rows that are selected for each group have a chance at being different since the sample_n() function selects rows randomly.

If you would like to make the code reproducible, you can use the set.seed() function to set a random “seed” that will allow us to select the same random rows each time.

For example, we could use the following code to do so:

#make this example reproducible
set.seed(1)

library(dplyr)

#select two random players from each team
df %>%
  group_by(team) %>%
  sample_n(size=2)

Now each time we run this code, the same random sample of rows will be selected.

Note: You can find the complete documentation for the sample_n() function from the dplyr package here.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Rename Columns Using dplyr
How to Add Row to Data Frame Using dplyr
How to Use the pull() Function in dplyr

Leave a Reply

Your email address will not be published. Required fields are marked *