# How to Use the findCorrelation() Function in R

Often you may want to create a correlation matrix in R and determine which variables need to be removed to reduce pairwise correlation between the variables.

One of the best ways to do so is by using the findCorrelation() function from the caret package in R, which can be used to perform this exact task.

The findCorrelation() function uses the following syntax:

findCorrelation(x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100 )

where:

• x: Name of correlation matrix
• cutoff: Cutoff value to use for correlation matrix
• verbose: Whether to print details or not
• names: Whether to return column names or not
• exact: Whether average correlation should be recomputed at each step

The following example shows how to use the findCorrelation() function in practice in R.

## Example: How to Use the findCorrelation() Function in R

Suppose that we create a data frame in R that contains information about various basketball players including their total points, assists, rebounds and steals:

```#create data frame
df <- data.frame(points=c(8, 10, 14, 14, 13, 28, 20, 24, 28, 30, 34, 40),
assists=c(3, 8, 8, 6, 10, 14, 8, 17, 13, 9, 10, 11),
rebounds=c(10, 8, 8, 7, 9, 5, 8, 6, 5, 4, 3, 3),
steals=c(2, 4, 4, 5, 3, 6, 7, 5, 7, 7, 9, 12))

#view data frame
df

points assists rebounds steals
1       8       3       10      2
2      10       8        8      4
3      14       8        8      4
4      14       6        7      5
5      13      10        9      3
6      28      14        5      6
7      20       8        8      7
8      24      17        6      5
9      28      13        5      7
10     30       9        4      7
11     34      10        3      9
12     40      11        3     12```

Suppose that we would like to create a correlation matrix to view the correlation coefficient between each pairwise combination of variables in the data frame.

We can use the cor() function to do so:

```#calculate correlation matrix for data frame
my_cor <- cor(df)

#view correlation matrix
my_cor

points    assists   rebounds     steals
points    1.0000000  0.5702686 -0.9520776  0.9183250
assists   0.5702686  1.0000000 -0.5306601  0.3448833
rebounds -0.9520776 -0.5306601  1.0000000 -0.8694698
steals    0.9183250  0.3448833 -0.8694698  1.0000000```

This returns a correlation matrix that contains the correlation coefficient between each pairwise combination of variables in the data frame.

Suppose that we would like to find which column(s) would need to be removed to reduce pairwise correlations between variables in the data frame.

We can use the findCorrelation() function from the caret package to do so:

```library(caret)

#extract significant correlation coefficients
findCorrelation(my_cor, verbose=TRUE, names=TRUE)

Compare row 1  and column  3 with corr  0.952
Means:  0.814 vs 0.669 so flagging column 1
All correlations <= 0.9
[1] "points"
```

This returns the column “points”, which tells us that we could remove this variable from the data frame to reduce overall pairwise correlation between the variables in the data frame.

Note that we specified the argument verbose=TRUE so that we could receive some explanation of why the “points” column is returned from this function.

In simple terms, the findCorrelation() function identified the “points” column as being highly correlated with other variables in the data frame, which means that we could remove the overall pairwise correlation between variables in the data frame by removing the “points” column.

In practice, there are many scenarios where having highly correlated variables in a data frame is not desirable, especially when building linear regression models.

Thus, this function is often used to identify the variables that we could remove from a data frame to avoid problems that are associated with high correlation between variables.