How to Use the summarize() Function in R


One of the most common tasks you’ll perform in data science and machine learning is summarizing values in a dataset.

Arguably the most common way to do so in the R programming language is by using the summarize() function from the dplyr package.

Note: The summarize() and summarise() functions are equivalent in dplyr.

This tutorial provides several examples of how to use the summarize() function in practice with the built-in mtcars dataset in R:

#view first six rows of mtcars
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Load the dplyr Package

Before you can use the summarize() function, you must first load the dplyr package:

#install dplyr package
install.packages('dplyr')

#load dplyr package
library(dplyr)

Example 1: Use Summarize() with One Variable

We can use the following syntax to summarize the mean value of the mpg column in the dataset:

#calculate mean of mpg column
mtcars %>%
  summarize(mean_mpg = mean(mpg, na.rm = TRUE))

  mean_mpg
1 20.09062

From the output we can see that the mean value for the mpg column is 20.09062.

Note that we used the argument na.rm = TRUE to specify that any missing values should be removed before calculating the mean value.

Also note that we can use similar syntax to calculate a variety of descriptive statistics.

For example, we could use the following syntax to calculate the median value for the mpg column instead:

#calculate median of mpg column
mtcars %>%
  summarize(median_mpg = median(mpg, na.rm = TRUE))

  median_mpg
1       19.2

From the output we can see that the median value for the mpg column is 19.2.

Or we could use the following syntax to calculate the 90th percentile of values for the mpg column:

#find 90th percentile of mpg column
mtcars %>%
  summarize(quant90 = quantile(mpg, probs = .9))

  quant90
1   30.09

From the output we can see that the 90th percentile of values for the mpg column is 30.09.

Example 2: Use Summarize() with Multiple Variables

We can also use the summarize() function to summarize multiple variables at once.

For example, we can use the following syntax to calculate the 90th percentile of values for three variables all at once:

#find 90th percentile of multiple columns
mtcars %>%
  summarize(quant90mpg = quantile(mpg, probs = .9),
            quant90qsec = quantile(qsec, probs = .9),
            quant90disp = quantile(disp, probs = .9))))

  quant90mpg quant90qsec quant90disp
1      30.09       19.99         396

From the output we can see:

  • The 90th percentile of values for the mpg column is 30.09.
  • The 90th percentile of values for the qsec column is 19.99.
  • The 90th percentile of values for the disp column is 396.

Feel free to include as many variables as you would like within the summarize() function.

Example 3: Use Summarize() by Group

The following code shows how to use the summarize() function to calculate the mean value of the mpg variable, grouped by the cyl variable:

#find mean of mpg grouped by cyl
mtcars %>%
  group_by(cyl) %>%
  summarize(mean_mpg = mean(mpg, na.rm = TRUE))

# A tibble: 3 x 2
    cyl mean_mpg
      
1     4     26.7
2     6     19.7
3     8     15.1

The output displays the mean value of the mpg variable for each unique value of the cyl variable.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Calculate Conditional Mean in R
How to Calculate a Trimmed Mean in R
How to Calculate a Weighted Mean in R

Leave a Reply

Your email address will not be published. Required fields are marked *