One of the most common tasks you’ll perform in data science and machine learning is summarizing values in a dataset.

Arguably the most common way to do so in the R programming language is by using the **summarize()** function from the **dplyr** package.

**Note:** The **summarize()** and **summarise()** functions are equivalent in dplyr.

This tutorial provides several examples of how to use the **summarize()** function in practice with the built-in mtcars dataset in R:

#view first six rows ofmtcarshead(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

**Load the dplyr Package**

Before you can use the **summarize()** function, you must first load the **dplyr** package:

#install dplyr package install.packages('dplyr') #load dplyr package library(dplyr)

**Example 1: Use Summarize() with One Variable**

We can use the following syntax to summarize the mean value of the **mpg** column in the dataset:

#calculate mean of mpg column mtcars %>% summarize(mean_mpg = mean(mpg, na.rm = TRUE)) mean_mpg 1 20.09062

From the output we can see that the mean value for the **mpg** column is **20.09062**.

Note that we used the argument** na.rm = TRUE** to specify that any missing values should be removed before calculating the mean value.

Also note that we can use similar syntax to calculate a variety of descriptive statistics.

For example, we could use the following syntax to calculate the median value for the **mpg** column instead:

#calculate median of mpg column mtcars %>% summarize(median_mpg = median(mpg, na.rm = TRUE)) median_mpg 1 19.2

From the output we can see that the median value for the **mpg** column is **19.2**.

Or we could use the following syntax to calculate the 90th percentile of values for the **mpg** column:

#find 90th percentile of mpg column mtcars %>% summarize(quant90 = quantile(mpg, probs = .9)) quant90 1 30.09

From the output we can see that the 90th percentile of values for the **mpg** column is **30.09**.

**Example 2: Use Summarize() with Multiple Variables**

We can also use the summarize() function to summarize multiple variables at once.

For example, we can use the following syntax to calculate the 90th percentile of values for three variables all at once:

#find 90th percentile of multiple columns mtcars %>% summarize(quant90mpg = quantile(mpg, probs = .9), quant90qsec = quantile(qsec, probs = .9), quant90disp = quantile(disp, probs = .9)))) quant90mpg quant90qsec quant90disp 1 30.09 19.99 396

From the output we can see:

- The 90th percentile of values for the
**mpg**column is**30.09**. - The 90th percentile of values for the
**qsec**column is**19.99**. - The 90th percentile of values for the
**disp**column is**396**.

Feel free to include as many variables as you would like within the **summarize()** function.

**Example 3: Use Summarize() by Group**

The following code shows how to use the **summarize()** function to calculate the mean value of the **mpg** variable, grouped by the **cyl** variable:

#find mean of mpg grouped by cyl mtcars %>% group_by(cyl) %>% summarize(mean_mpg = mean(mpg, na.rm = TRUE)) # A tibble: 3 x 2 cyl mean_mpg 1 4 26.7 2 6 19.7 3 8 15.1

The output displays the mean value of the **mpg** variable for each unique value of the **cyl** variable.

**Additional Resources**

The following tutorials explain how to perform other common tasks in R:

How to Calculate Conditional Mean in R

How to Calculate a Trimmed Mean in R

How to Calculate a Weighted Mean in R