How to Group data.table by Multiple Columns in R


Often you may want to group the rows of a data.table in R based on multiple columns.

You can use the following basic syntax to do so:

dt[, mean_points:=mean(points), by=c('team', 'position')]

This particular example will group the rows of the data.table named dt by the team and position columns and then create a new column named mean_points that calculates the mean value of the points column.

Note that we chose to use the mean() function to calculate the mean value of one column, grouped by two other columns, but you can use whatever function you’d like when summarizing your own data.table in R.

The following example shows how to group a data.table by multiple columns in R in practice.

Example: How to Group data.table by Multiple Columns in R

Suppose we create the following data.table in R that contains information about various basketball players:

library(data.table)

#create data table
dt <- data.table(team=c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'),
                 position=c('G', 'G', 'F', 'F', 'G', 'G', 'F', 'F'),
                 points=c(99, 68, 86, 88, 95, 74, 78, 93),
                 assists=c(22, 28, 45, 35, 34, 45, 28, 31))

#view data table
dt

   team position points assists
1:    A        G     99      22
2:    A        G     68      28
3:    A        F     86      45
4:    A        F     88      35
5:    B        G     95      34
6:    B        G     74      45
7:    B        F     78      28
8:    B        F     93      31

The data.table contains the following columns:

  • team: The team name a player belongs to
  • position: The position of the player
  • points: The total points scored by the player
  • assists: The total assists made by the player

Suppose that we would like to calculate the mean value of the points column, grouped by the values in the team and position columns.

We can use the following syntax to do so:

library(data.table)

#calculate mean of points column, grouped by team and position column
dt[, mean_points:=mean(points), by=c('team', 'position')]

#view updated data table
dt

   team position points assists mean_points
1:    A        G     99      22        83.5
2:    A        G     68      28        83.5
3:    A        F     86      45        87.0
4:    A        F     88      35        87.0
5:    B        G     95      34        84.5
6:    B        G     74      45        84.5
7:    B        F     78      28        85.5
8:    B        F     93      31        85.5

The new column named mean_points contains the mean value of the points column, grouped by the values in the team and position columns.

For example, we can see:

  • The mean points scored by players on team A in position G is 83.5.
  • The mean points scored by players on team A in position F is 87.0.
  • The mean points scored by players on team B in position G is 84.5.
  • The mean points scored by players on team B in position F is 85.5.

Note that you can replace the mean() function with any function that you would like to calculate a different descriptive statistic.

For example, we could use the following syntax to instead calculate the max of the points column, grouped by the team and position columns:

library(data.table)

#calculate max of points column, grouped by team and position column
dt[, max_points:=max(points), by=c('team', 'position')]

#view updated data table
dt

   team position points assists max_points
1:    A        G     99      22         99
2:    A        G     68      28         99
3:    A        F     86      45         88
4:    A        F     88      35         88
5:    B        G     95      34         95
6:    B        G     74      45         95
7:    B        F     78      28         93
8:    B        F     93      31         93

The new column named max_points contains the max value of the points column, grouped by the values in the team and position columns.

Additional Resource

The following tutorials explain how to perform other common tasks in R:

How to Filter a data.table in R
How to Sort a data.table in R
How to Remove Empty Rows from Data Frame in R

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *