How to Use the discretize() Function in R


Often you may want to convert a continuous variable to a categorical variable in R.

One of the easiest ways to do so is by using the discretize() function from the arules package in R, which can be used to perform this exact task.

The discretize() function uses the following syntax:

discretize(x, method=’frequency’, breaks=3, labels=NULL, include.lowest=TRUE, right=FALSE, …)

where:

  • x: Name of data frame
  • method: Method to use for discretization
  • breaks: Number of categories or a vector with boundaries
  • labels: Labels for resulting categories
  • include.lowest: Whether the first interval should be closed to the left
  • right: Whether the intervals should be closed on the right

The following example shows how to use the discretize() function in practice.

Note: Before using the discretize() function, you may need to first install the arules package. You can use the following syntax to do so:

install.packages('arules')

Once the arules package has been installed, you can proceed to use the discretize() function.

Example: How to Use the discretize() Function in R

Suppose that we create a vector named my_values that contains 15 numeric values.

Suppose that we would like to discretize the vector so that each value in the vector is placed into one of three bins in which each bin has the same frequency.

We can use the discretize() function with the following syntax to do so:

library(arules)

#create vector of values
my_values <- c(3, 3, 4, 4, 7, 8, 10, 11, 13, 14, 15, 19, 22, 22, 28)

#discretize values in vector
discretize(my_values)

 [1] [3,7.67)    [3,7.67)    [3,7.67)    [3,7.67)    [3,7.67)    [7.67,14.3)
 [7] [7.67,14.3) [7.67,14.3) [7.67,14.3) [7.67,14.3) [14.3,28]   [14.3,28]  
[13] [14.3,28]   [14.3,28]   [14.3,28]  
attr(,"discretized:breaks")
[1]  3.000000  7.666667 14.333333 28.000000
attr(,"discretized:method")
[1] frequency
Levels: [3,7.67) [7.67,14.3) [14.3,28]

We can see that each of the values in the original vector have been placed into one of the following categories:

  • [3, 7.67)
  • [7.67, 14.3)
  • [14.3, 28)

Notice that there are five values in each of these categories. This is because the default method for the discretize() function is ‘frequency’ , which ensures that each category has the same frequency.

By using this method, there is no guarantee that each category actually has the same width. We can see from the category boundaries that indeed the first bin has the smallest width while the last category has the largest width.

Suppose instead that we specified method=’interval’ to ensure that each category has the same width:

library(arules)

#create vector of values
my_values <- c(3, 3, 4, 4, 7, 8, 10, 11, 13, 14, 15, 19, 22, 22, 28)

#discretize values in vector
discretize(my_values, method='interval')

 [1] [3,11.3)    [3,11.3)    [3,11.3)    [3,11.3)    [3,11.3)    [3,11.3)   
 [7] [3,11.3)    [3,11.3)    [11.3,19.7) [11.3,19.7) [11.3,19.7) [11.3,19.7)
[13] [19.7,28]   [19.7,28]   [19.7,28]  
attr(,"discretized:breaks")
[1]  3.00000 11.33333 19.66667 28.00000
attr(,"discretized:method")
[1] interval
Levels: [3,11.3) [11.3,19.7) [19.7,28]

This method places each of the values in the original vector into one of the following categories:

  • [3, 11.3)
  • [11.3, 19.7)
  • [19.7, 28]

Notice that the width of each of these categories is the exact same but there is not an equal frequency of values that fall into each of these categories.

The first category contains nine values, the second category contains three values and the third category also contains three values.

Feel free to use whichever method you prefer depending on how you would like your individual values to be placed into categories.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Create a Frequency Table by Group in R
How to Create Relative Frequency Tables in R
How to Use the describe() Function in R
How to Use the describeBy() Function in R

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *