Two of the most common tasks that you’ll perform in data analysis are grouping and summarizing data.

Fortunately the dplyr package in R allows you to quickly group and summarize data.

This tutorial provides a quick guide to getting started with dplyr.

**Install & Load the dplyr Package**

Before you can use the functions in the dplyr package, you must first load the package:

#install dplyr (if not already installed) install.packages('dplyr') #load dplyr library(dplyr)

Next, we’ll illustrate several examples of how to use the functions in dplyr to group and summarize data using the built-in R dataset called **mtcars**:

#obtain rows and columns ofmtcarsdim(mtcars) [1] 32 11 #view first six rows ofmtcarshead(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

The basic syntax that we’ll use to group and summarize data is as follows:

data %>% group_by(col_name) %>% summarize(summary_name = summary_function)

**Note: **The functions summarize() and summarise() are equivalent.

**Example 1: Find Mean & Median by Group**

The following code shows how to calculate measures of central tendency by group including the mean and the median:

#find mean mpg by cylinder mtcars %>% group_by(cyl) %>% summarize(mean_mpg = mean(mpg, na.rm = TRUE)) # A tibble: 3 x 2 cyl mean_mpg 1 4 26.7 2 6 19.7 3 8 15.1 #find median mpg by cylinder mtcars %>% group_by(cyl) %>% summarize(median_mpg = median(mpg, na.rm = TRUE)) # A tibble: 3 x 2 cyl median_mpg 1 4 26 2 6 19.7 3 8 15.2

**Example 2: Find Measures of Spread by Group**

The following code shows how to calculate measures of dispersion by group including the standard deviation, interquartile range, and median absolute deviation:

#find sd, IQR, and mad by cylinder mtcars %>% group_by(cyl) %>% summarize(sd_mpg = sd(mpg, na.rm = TRUE), iqr_mpg = IQR(mpg, na.rm = TRUE), mad_mpg = mad(mpg, na.rm = TRUE)) # A tibble: 3 x 4 cyl sd_mpg iqr_mpg mad_mpg 1 4 4.51 7.60 6.52 2 6 1.45 2.35 1.93 3 8 2.56 1.85 1.56

**Example 3: Find Count by Group**

The following code shows how to find the count and the unique count by group in R:

#find row count and unique row count by cylinder mtcars %>% group_by(cyl) %>% summarize(count_mpg = n(), u_count_mpg = n_distinct(mpg)) # A tibble: 3 x 3 cyl count_mpg u_count_mpg 1 4 11 9 2 6 7 6 3 8 14 12

**Example 4: Find Percentile by Group**

The following code shows how to find the 90th percentile of values for mpg by cylinder group:

#find 90th percentile of mpg for each cylinder group mtcars %>% group_by(cyl) %>% summarize(quant90 = quantile(mpg, probs = .9)) # A tibble: 3 x 2 cyl quant90 1 4 32.4 2 6 21.2 3 8 18.3

**Additional Resources**

You can find the complete documentation for the dplyr package along with helpful visualize cheat sheets here.

Other useful functions that you can use along with **group_by()** and **summarize()** include functions for filtering data frame rows and arranging rows in certain orders.