Often you may want to create a correlation matrix in R and determine which variables need to be removed to reduce pairwise correlation between the variables.

One of the best ways to do so is by using the **findCorrelation()** function from the **caret **package in R, which can be used to perform this exact task.

The **findCorrelation()** function uses the following syntax:

**findCorrelation(x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100 )
**

where:

**x:**Name of correlation matrix**cutoff:**Cutoff value to use for correlation matrix**verbose**: Whether to print details or not**names**: Whether to return column names or not**exact**: Whether average correlation should be recomputed at each step

The following example shows how to use the **findCorrelation()** function in practice in R.

**Example: How to Use the findCorrelation() Function in R**

Suppose that we create a data frame in R that contains information about various basketball players including their total points, assists, rebounds and steals:

**#create data frame
df <- data.frame(points=c(8, 10, 14, 14, 13, 28, 20, 24, 28, 30, 34, 40),
assists=c(3, 8, 8, 6, 10, 14, 8, 17, 13, 9, 10, 11),
rebounds=c(10, 8, 8, 7, 9, 5, 8, 6, 5, 4, 3, 3),
steals=c(2, 4, 4, 5, 3, 6, 7, 5, 7, 7, 9, 12))
#view data frame
df
points assists rebounds steals
1 8 3 10 2
2 10 8 8 4
3 14 8 8 4
4 14 6 7 5
5 13 10 9 3
6 28 14 5 6
7 20 8 8 7
8 24 17 6 5
9 28 13 5 7
10 30 9 4 7
11 34 10 3 9
12 40 11 3 12**

Suppose that we would like to create a correlation matrix to view the correlation coefficient between each pairwise combination of variables in the data frame.

We can use the **cor()** function to do so:

**#calculate correlation matrix for data frame
my_cor <- cor(df)
#view correlation matrix
my_cor
points assists rebounds steals
points 1.0000000 0.5702686 -0.9520776 0.9183250
assists 0.5702686 1.0000000 -0.5306601 0.3448833
rebounds -0.9520776 -0.5306601 1.0000000 -0.8694698
steals 0.9183250 0.3448833 -0.8694698 1.0000000**

This returns a correlation matrix that contains the correlation coefficient between each pairwise combination of variables in the data frame.

Suppose that we would like to find which column(s) would need to be removed to reduce pairwise correlations between variables in the data frame.

We can use the **findCorrelation()** function from the **caret** package to do so:

**library(caret)
#extract significant correlation coefficients
findCorrelation(my_cor, verbose=TRUE, names=TRUE)
Compare row 1 and column 3 with corr 0.952
Means: 0.814 vs 0.669 so flagging column 1
All correlations <= 0.9
[1] "points"
**

This returns the column “points”, which tells us that we could remove this variable from the data frame to reduce overall pairwise correlation between the variables in the data frame.

Note that we specified the argument **verbose=TRUE** so that we could receive some explanation of why the “points” column is returned from this function.

In simple terms, the **findCorrelation()** function identified the “points” column as being highly correlated with other variables in the data frame, which means that we could remove the overall pairwise correlation between variables in the data frame by removing the “points” column.

In practice, there are many scenarios where having highly correlated variables in a data frame is not desirable, especially when building linear regression models.

Thus, this function is often used to identify the variables that we could remove from a data frame to avoid problems that are associated with high correlation between variables.

**Additional Resources**

The following tutorials explain how to perform other common tasks in R:

How to Sort a Table in R

How to Plot a Table in R

How to Create a Three-Way Table in R

How to Create a Frequency Table by Group in R