How to Deal with Missing Values in R

A guide to dealing with missing values in R

Often when you’re working with datasets, you’ll encounter missing values. Missing  values in R are often represented as NA or some other value like 999.

Fortunately it’s easy to identify, modify, and exclude missing values in R and this tutorial explains how to do so.

How to Identify Missing Values

The most straightforward way to identify missing values in a vector, list, matrix, or data frame is by using is.na(), which returns TRUE or FALSE for each value in the data structure.

The following code shows how to test whether each value in a vector is missing or not:

#create a vector x with some missing values
x <- c(4, NA, 12, NA, 19, 34, 22)

#test whether each value is missing or not
is.na(x)

#[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE

#identify the position of the missing values in the vector
which(is.na(x))

#[1] 2 4

#identify total number of missing values in the vector
sum(is.na(x))

#[1] 2

This next chunk of code shows how to test whether each value in a data frame is missing or not:

#create a data frame with three columns and five rows
data <- data.frame(a = c(2, 5, 12, NA, NA),
                   b = c(15, NA, 4, 3, 1),
                   c = c(21, 23, NA, 9, 4))
data

#   a  b  c
#1  2 15 21
#2  5 NA 23
#3 12  4 NA
#4 NA  3  9
#5 NA  1  4

#identify all missing values in entire data frame
is.na(data)

#       a     b     c
#[1,] FALSE FALSE FALSE
#[2,] FALSE  TRUE FALSE
#[3,] FALSE FALSE  TRUE
#[4,]  TRUE FALSE FALSE
#[5,]  TRUE FALSE FALSE

#identify total number of missing values in entire data frame
sum(is.na(data))

#[1] 4

#identify total number of missing values in column named 'c'
sum(is.na(data$c))

#[1] 1

#identify total number of missing values in each column
colSums(is.na(data))

#a b c 
#2 1 1 

How to Modify Missing Values

Once you’ve identified missing values in a vector or data frame, you may then want to replace the missing values with some other value like the median or the mean. 

The following code shows how to replace missing values in a vector:

#create a vector x with some missing values
x <- c(4, NA, 12, NA, 19, 34, 22)

#replace missing values with the mean value of the vector
x[is.na(x)] <- mean(x, na.rm = TRUE)
x

#[1] 4.0 18.2 12.0 18.2 19.0 34.0 22.0

#Alternatively, replace missing values with the median value of the vector
x[is.na(x)] <- median(x, na.rm = TRUE)
x

#[1] 4 19 12 19 19 34 22

This next chunk of code shows how to replace missing values in a data frame:

#create a data frame with missing values coded as '999'
data <- data.frame(a = c(2, 5, 12, 999, 999),
                   b = c(15, 999, 4, 3, 1), 
                   c = c(21, 23, 999, 9, 4))
data

#    a   b   c
#1   2  15  21
#2   5 999  23
#3  12   4 999
#4 999   3   9
#5 999   1   4

#replace missing values with NAs
data[data == 999] <- NA
data

   a  b  c
1  2 15 21
2  5 NA 23
3 12  4 NA
4 NA  3  9
5 NA  1  4

#replace missing values in column 'c' with the average value of column 'c'
data$c[is.na(data$c)] <- mean(data$c, na.rm = TRUE)
data

#   a  b     c
#1  2 15 21.00
#2  5 NA 23.00
#3 12  4 14.25
#4 NA  3  9.00
#5 NA  1  4.00

How to Exclude Missing Values

If you would instead like to exclude missing values, rather than modify or replace them, then you can do so fairly easily in R.

When performing operations on vectors (like finding the sum, mean, median, max, min, etc.), the easiest way to exclude missing values is to use the argument na.rm = TRUE. If missing values are present and you fail to use this argument, then you will likely receive NA as a result of your operation.

The following code illustrates a couple examples of using na.rm = TRUE to find the mean and the median of a vector:

#create a vector x with some missing values
x <- c(4, NA, 12, NA, 19, 34, 22) 

#find mean of vector without using na.rm = TRUE
mean(x)

#[1] NA

#find mean of vector using na.rm = TRUE to exclude missing values
mean(x, na.rm = TRUE)

#[1] 18.2

#find median of vector without using na.rm = TRUE
 median(x)

#[1] NA

#find median of vector using na.rm = TRUE to exclude missing values
median(x, na.rm = TRUE)

#[1] 19

When dealing with data frames, you may only want to look at the rows where there are no missing values. There are two common methods for excluding rows that contain missing values:

complete.cases() – returns rows of data frame that are “complete”, i.e. have no missing values. The syntax for using this is as follows:

data[complete.cases(data), ]

na.omit() – omits any rows in a data frame that have missing values. The syntax for using this is as follows:

na.omit(data)

Notice that these two functions return the same results.

The following code illustrates these two functions in action:

#create a data frame with some missing values
data <- data.frame(a = c(2, 5, 12, NA, NA),
                   b = c(15, NA, 4, 3, 1),
                   c = c(21, 23, NA, 9, 4))
data

#   a  b  c
#1  2 15 21
#2  5 NA 23
#3 12  4 NA
#4 NA  3  9
#5 NA  1  4

#only look at rows where there is no missing data
data[complete.cases(data), ]

#  a  b  c
#1 2 15 21

#omit all rows with missing data
na.omit(data)

#  a  b  c
#1 2 15 21

Leave a Reply

Your email address will not be published. Required fields are marked *