You can use the following methods to calculate correlation coefficients in R when one or more variables have missing values:
Method 1: Calculate Correlation Coefficient with Missing Values Present
cor(x, y, use='complete.obs')
Method 2: Calculate Correlation Matrix with Missing Values Present
cor(df, use='pairwise.complete.obs')
The following examples show how to use each method in practice.
Example 1: Calculate Correlation Coefficient with Missing Values Present
Suppose we attempt to use the cor() function to calculate the Pearson correlation coefficient between two variables when missing values are present:
#create two variables
x <- c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85)
y <- c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75)
#attempt to calculate correlation coefficient between x and y
cor(x, y)
[1] NA
The cor() function returns NA since we didn’t specify how to handle missing values.
To avoid this issue, we can use the argument use=’complete.obs’ so that R knows to only use pairwise observations where both values are present:
#create two variables
x <- c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85)
y <- c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75)
#calculate correlation coefficient between x and y
cor(x, y, use='complete.obs')
[1] -0.4888749
The correlation coefficient between the two variables turns out to be -0.488749.
Note that the cor() function only used pairwise combinations where both values were present when calculating the correlation coefficient.
Example 2: Calculate Correlation Matrix with Missing Values Present
Suppose we attempt to use the cor() function to create a correlation matrix for a data frame with three variables when missing values are present:
#create data frame with some missing values
df <- data.frame(x=c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85),
y=c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75),
z=c(57, 57, 58, 59, 60, 78, 81, 83, NA, 90))
#attempt to create correlation matrix for variables in data frame
cor(df)
x y z
x 1 NA NA
y NA 1 NA
z NA NA 1
The cor() function returns NA in several locations since we didn’t specify how to handle missing values.
To avoid this issue, we can use the argument use=’pairwise.complete.obs’ so that R knows to only use pairwise observations where both values are present:
#create data frame with some missing values
df <- data.frame(x=c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85),
y=c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75),
z=c(57, 57, 58, 59, 60, 78, 81, 83, NA, 90))
#create correlation matrix for variables using only pairwise complete observations
cor(df, use='pairwise.complete.obs')
x y z
x 1.0000000 -0.4888749 0.1311651
y -0.4888749 1.0000000 -0.1562371
z 0.1311651 -0.1562371 1.0000000
The correlation coefficients for each pairwise combination of variables in the data frame are now shown.
Additional Resources
The following tutorials explain how to perform other common tasks in R:
How to Find the P-value of Correlation Coefficient in R
How to Calculate Spearman Correlation in R
How to Calculate Rolling Correlation in R