How to Manipulate, Aggregate, and Analyze Data in R Using dplyr

How to recode variables in R using dplyr

The R package dplyr makes it easy to manipulate, aggregate, and analyze data. This tutorial explains the basic functions of the dplyr package as well as several examples of how to use them.

The Basic Functions of dplyr

The package dplyr offers the following functions for working with data: 

select() – subset data by columns

filter() – subset data by rows

mutate() – add columns

group_by() – split data into groups

summarize() – summarize groups or columns

arrange() – sort rows by column values

Loading dplyr

The first step to using the dplyr package is to install it (if not already installed) and load it. This single line of code will check if dplyr is installed, download it if it’s not already installed, and then load it:

if(!require(dplyr)){install.packages('dplyr')}

Working with Data Using dplyr

The following examples illustrate how to use the functions outlined above to work with data using dplyr in a variety of ways. Each of the following examples will use the built-in R dataset iris.

#view first six rows of iris dataset
head(iris)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1          5.1         3.5          1.4         0.2  setosa
#2          4.9         3.0          1.4         0.2  setosa
#3          4.7         3.2          1.3         0.2  setosa
#4          4.6         3.1          1.5         0.2  setosa
#5          5.0         3.6          1.4         0.2  setosa
#6          5.4         3.9          1.7         0.4  setosa

select()

The following code illustrates how to select specific columns of the iris dataset using the select() function:

#select Sepal.Length and Species columns
specific_cols <- iris %>%
  select(Sepal.Length, Species)

#view first six rows
head(specific_cols)

#  Sepal.Length Species
#1          5.1  setosa
#2          4.9  setosa
#3          4.7  setosa
#4          4.6  setosa
#5          5.0  setosa
#6          5.4  setosa

filter()

The following code illustrates how to filter on specific rows of the iris dataset using the filter() function: 

#filter on rows where Sepal.Length is greater than 5
specific_rows <- iris %>%
  filter(Sepal.Length > 5)

#view first six rows
head(specific_rows)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1          5.1         3.5          1.4         0.2  setosa
#2          5.4         3.9          1.7         0.4  setosa
#3          5.4         3.7          1.5         0.2  setosa
#4          5.8         4.0          1.2         0.2  setosa
#5          5.7         4.4          1.5         0.4  setosa
#6          5.4         3.9          1.3         0.4  setosa

mutate()

The following code illustrates how to add new columns to the iris dataset using the mutate() function:

#add two new columns to iris dataset
new_col <- iris %>%
  mutate(double_sep_length = Sepal.Length * 2,
         half_petal_width = Petal.Width / 2) 

#view first six rows
head(new_col)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species double_sep_length
#1          5.1         3.5          1.4         0.2  setosa              10.2
#2          4.9         3.0          1.4         0.2  setosa               9.8
#3          4.7         3.2          1.3         0.2  setosa               9.4
#4          4.6         3.1          1.5         0.2  setosa               9.2
#5          5.0         3.6          1.4         0.2  setosa              10.0
#6          5.4         3.9          1.7         0.4  setosa              10.8

#  half_petal_width
#1              0.1
#2              0.1
#3              0.1
#4              0.1
#5              0.1
#6              0.2

group_by() & summarize()

The following code illustrates how to use the group_by() and summarize() functions to find the summary statistics of different groups in the iris dataset:

#group by Species and find min and max petal width for each Species
group_sum <- iris %>% 
  group_by(Species) %>%
  summarize(min = min(Petal.Width),
            max = max(Petal.Width))

#view results
group_sum

#  Species      min   max         
#1 setosa       0.1   0.6
#2 versicolor   1     1.8
#3 virginica    1.4   2.5

#----------------------------------------------------------------------------#

#group by Species and find mean and standard dev. of Sepal.Length for each Species
group_sum <- iris %>% 
  group_by(Species) %>%
  summarize(mean = mean(Sepal.Length),
            sd = sd(Sepal.Length))

#view results
group_sum

#  Species     mean    sd        
#1 setosa      5.01 0.352
#2 versicolor  5.94 0.516
#3 virginica   6.59 0.636

#----------------------------------------------------------------------------#

#group by Species and find count for each Species
group_sum <- iris %>% 
  group_by(Species) %>%
  summarize(count = n())

#view results
group_sum

#  Species    count       
#1 setosa        50
#2 versicolor    50
#3 virginica     50

arrange()

The following code illustrates how to sort the rows of the iris dataset based on column values using the arrange() function.

Note: By default, arrange() sorts by ascending order

#sort rows by Sepal.Length ascending
sorted <- iris %>%
  arrange(Sepal.Length)

#view first six rows
head(sorted)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1          4.3         3.0          1.1         0.1  setosa
#2          4.4         2.9          1.4         0.2  setosa
#3          4.4         3.0          1.3         0.2  setosa
#4          4.4         3.2          1.3         0.2  setosa
#5          4.5         2.3          1.3         0.3  setosa
#6          4.6         3.1          1.5         0.2  setosa

#----------------------------------------------------------------------------# 

#sort rows by Sepal.Length descending
sorted_desc <- iris %>%
  arrange(desc(Sepal.Length))

#view first six rows
head(sorted_desc)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#1          7.9         3.8          6.4         2.0 virginica
#2          7.7         3.8          6.7         2.2 virginica
#3          7.7         2.6          6.9         2.3 virginica
#4          7.7         2.8          6.7         2.0 virginica
#5          7.7         3.0          6.1         2.3 virginica
#6          7.6         3.0          6.6         2.1 virginica

#----------------------------------------------------------------------------# 
#sort rows by Sepal.Length descending, then by Sepal.Width descending
sorted_multiple <- iris %>% 
  arrange(desc(Sepal.Length),
          desc(Sepal.Width))

#view first six rows
head(sorted_multiple)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#1          7.9         3.8          6.4         2.0 virginica
#2          7.7         3.8          6.7         2.2 virginica
#3          7.7         3.0          6.1         2.3 virginica
#4          7.7         2.8          6.7         2.0 virginica
#5          7.7         2.6          6.9         2.3 virginica
#6          7.6         3.0          6.6         2.1 virginica

Using Several Functions Together

The following examples illustrate how to use several of the dplyr functions together to manipulate, aggregate, and analyze the iris dataset in a variety of different ways.

#find the mean Petal.Length and Petal.Width of the 'virginica' and 'setosa'
#Species, then sort by mean Petal.Length descending
two_means <- iris %>%
  filter(Species %in% c('virginica', 'setosa')) %>%
  group_by(Species) %>%
  summarize(mean_petal_length = mean(Petal.Length),
            mean_petal_width = mean(Petal.Width)) %>%
  arrange(desc(mean_petal_length))

#view first six rows
head(two_means)

#  Species   mean_petal_length mean_petal_width                               
#1 virginica              5.55            2.03 
#2 setosa                 1.46            0.246

#----------------------------------------------------------------------------# 
#create a new column that finds the average of the Petal.Width and Petal.Length
#for each 'virginica' Species where the Sepal.Length is less than 7
new_measure <- iris %>%
  filter(Species == 'virginica',
         Sepal.Length < 7) %>%
  mutate(mean_petal_measure = (Petal.Width+Petal.Length) / 2)

#view first six rows
head(new_measure)

#  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#1          6.3         3.3          6.0         2.5 virginica
#2          5.8         2.7          5.1         1.9 virginica
#3          6.3         2.9          5.6         1.8 virginica
#4          6.5         3.0          5.8         2.2 virginica
#5          4.9         2.5          4.5         1.7 virginica
#6          6.7         2.5          5.8         1.8 virginica

#  mean_petal_measure
#1               4.25
#2               3.50
#3               3.70
#4               4.00
#5               3.10
#6               3.80

Check out the dplyr Cheat Sheet for a visual guide to all of the dplyr functions.

Leave a Reply

Your email address will not be published. Required fields are marked *