Grubbs’ Test is a statistical test that can be used to identify the presence of outliers in a dataset.
In order to use this test, a dataset should be approximately normally distributed and have at least 7 observations.
This tutorial explains how to perform Grubbs’ Test in R to detect outliers in a dataset.
Example: Grubbs’ Test in R
To perform Grubbs’ Test in R, we can use the grubbs.test() function from the Outliers package, which uses the following syntax:
grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)
- x: a numeric vector of data values
- type: 10 = test if max value is outlier, 11 = test if both min and max value are outliers, 20 = test if there are two outliers on one tail
- opposite: logical indicating whether you want to check not the value with largest difference from the mean, but opposite (lowest, if most suspicious is highest etc.)
- two-sided: logical value indicating whether or not you should treat the test as two-sided
This test uses the following two hypotheses:
H0 (null hypothesis): There is no outlier in the data.
HA (alternative hypothesis): There is an outlier in the data.
The following example illustrates how to perform Grubbs’ Test to determine if the max value in a dataset is an outlier:
#load Outliers package library(Outliers) #create data data <- c(5, 14, 15, 15, 14, 13, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40) #perform Grubbs' Test to see if '40' is an outlier grubbs.test(data) # Grubbs test for one outlier # #data: data #G = 2.65990, U = 0.55935, p-value = 0.02398 #alternative hypothesis: highest value 40 is an outlier
The test statistic of the test is G = 2.65990 and the corresponding p-value is p = 0.02398. Since this value is less than 0.05, we will reject the null hypothesis and conclude that the max value of 40 is an outlier.
If we instead wanted to test whether the lowest value of ‘5’ was an outlier, we could use the opposite=TRUE command:
#perform Grubbs' Test to see if '5' is an outlier grubbs.test(data, opposite=TRUE) # Grubbs test for one outlier # #data: data #G = 1.4879, U = 0.8621, p-value = 1 #alternative hypothesis: lowest value 5 is an outlier
The test statistic is G = 1.4879 and the corresponding p-value is p = 1. Since this value is not less than 0.05, we fail to reject the null hypothesis. We do not have sufficient evidence to say that the minimum value of ‘5’ is an outlier.
Lastly, suppose we had two large values at one end of the dataset: 40 and 42. To test if both of these values are outliers, we could perform Grubbs’ Test and specify that type=20:
#create dataset with two large values at one end: 40 and 42 data <- c(5, 14, 15, 15, 14, 13, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40, 42) #perform Grubbs' Test to see if both 40 and 42 are outliers grubbs.test(data, type=20) # Grubbs test for two outliers # #data: data #U = 0.38111, p-value = 0.01195 #alternative hypothesis: highest values 40 , 42 are outliers
The p-value of the test is 0.01195. Since this is less than 0.05, we can reject the null hypothesis and conclude that we have sufficient evidence to say the values 40 and 42 are both outliers.
How to Handle Outliers
If Grubbs’ Test does identify an outlier in your dataset, you have a few options:
1. Double check to make sure that the value is not a typo or a data entry error. Occasionally, values that show up as outliers in datasets are simply typos made by an individual when entering the data. Go back and verify that the value was entered correctly before you make any further decisions.
2. Assign a new value to the outlier. If the outlier turns out to be a result of a typo or data entry error, you may decide to assign a new value to it, such as the mean or the median of the dataset.
3.Remove the outlier. If the value is a true outlier, you may choose to remove it if it will have a significant impact on your overall analysis.