How to Winsorize Data in R


To winsorize data means to set extreme outliers equal to a specified percentile of the data.

For example, a 90% winsorization sets all observations greater than the 95th percentile equal to the value at the 95th percentile and all observations less than the 5th percentile equal to the value at the 5th percentile.

The easiest way to winsorize data in R is by using the Winsorize() function from the DescTools package, which is designed to perform this exact task.

This function uses the following basic syntax:

Winsorize(x, minval = NULL, maxval = NULL, probs = c(0.05, 0.95), na.rm = FALSE, type = 7)

where:

  • x: Name of vector to winsorize
  • minval: All values lower than this value will be replaced by this value (default is 5%-quantile of x)
  • maxval: All values larger than this value will be replaced by this value (default is 95%-quantile of x)
  • probs: Numeric vector of probabilities as used in quantile
  • na.rm: Whether to omit NA values when calculating quantiles
  • type: an integer between 1 and 9 selecting one of the nine quantile algorithms detailed in ‘quantile’ function to be used

The following example shows how to use the Winsorize() function in practice.

Note: Before using the Winsorize() function, you need to first make sure that the DescTools package is installed.

You can use the following syntax to install this package:

install.packages('DescTools')

Once the DescTools package has successfully been installed, you can use the Winsorize() function without encountering any errors.

Example: How to Winsorize Data in R

Suppose that we create the following data frame that contains information about total sales made by various employees at some company:

#create data frame
df <- data.frame(emp=LETTERS[1:18],
                 sales=c(3, 14, 16, 16, 17, 29, 34, 36, 39, 47, 59,
                         64, 65, 66, 68, 79, 91, 98))

#view data frame
df

   emp sales
1    A     3
2    B    14
3    C    16
4    D    16
5    E    17
6    F    29
7    G    34
8    H    36
9    I    39
10   J    47
11   K    59
12   L    64
13   M    65
14   N    66
15   O    68
16   P    79
17   Q    91
18   R    98

Suppose that we would like to winsorize the values in the sales column such that any sales value greater than the 95th percentile is set to the 95th percentile and any value less than the 5th percentile is set to the 5th percentile.

We can use the Winsorize() function to do so:

library(DescTools)

#winsorize values in sales column of data frame
df$sales <- Winsorize(df$sales)

#view updated data frame
df

   emp    sales
1    A    12.35
2    B    14
3    C    16
4    D    16
5    E    17
6    F    29
7    G    34
8    H    36
9    I    39
10   J    47
11   K    59
12   L    64
13   M    65
14   N    66
15   O    68
16   P    79
17   Q    91
18   R    95.05

Notice that this returns all of the same values in the sales column except the first and last values in the column have been winsorized to be equal to the 5th and 95th percentile of values, respectively.

Specifically, we can see that the minimum value of 3 has been replaced with 12.35, which represents the 5th percentile of values in the sales column.

We can also see that the maximum value of 98 has been replaced with 95.05, which represents the 95th percentile of values in the sales column.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Scale Values Between 0 and 1 in R
How to Normalize Data in R
How to Standardize Data in R
How to Average Across Columns in R

Leave a Reply

Your email address will not be published. Required fields are marked *