A Simple Explanation of the Jaccard Similarity Index


The Jaccard Similarity Index is a measure of the similarity between two sets of data.

Developed by Paul Jaccard, the index ranges from 0 to 1. The closer to 1, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

If two datasets share the exact same members, their Jaccard Similarity Index will be 1. Conversely, if they have no members in common then their similarity will be 0.

The following examples show how to calculate the Jaccard Similarity Index for a few different datasets.

Example 1: Jaccard Similarity 

Suppose we have the following two sets of data:

A = [0, 1, 2, 5, 6, 8, 9]
B = [0, 2, 3, 4, 5, 7, 9]

To calculate the Jaccard Similarity between them, we first find the total number of observations in both sets, then divide by the total number of observations in either set:

  • Number of observations in both: {0, 2, 5, 9} = 4
  • Number of observations in either: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} = 10
  • Jaccard Similarity: 4 / 10 = 0.4

The Jaccard Similarity Index turns out to be 0.4.

Example 2: Jaccard Similarity Continued

Suppose we have the following two sets of data:

C = [0, 1, 2, 3, 4, 5]
D = [6, 7, 8, 9, 10]

To calculate the Jaccard Similarity between them, we first find the total number of observations in both sets, then divide by the total number of observations in either set:

  • Number of observations in both: {} = 0
  • Number of observations in either: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = 11
  • Jaccard Similarity: 0 / 11 = 0

The Jaccard Similarity Index turns out to be 0. This indicates that the two datasets share no common members.

Example 3: Jaccard Similarity for Characters

Note that we can also use the Jaccard Similarity index for datasets that contain characters as opposed to numbers.

For example, suppose we have the following two sets of data:

E = ['cat', 'dog', 'hippo', 'monkey']
F = ['monkey', 'rhino', 'ostrich', 'salmon']

To calculate the Jaccard Similarity between them, we first find the total number of observations in both sets, then divide by the total number of observations in either set:

  • Number of observations in both: {‘monkey’} = 1
  • Number of observations in either: {‘cat’, ‘dog’, hippo’, ‘monkey’, ‘rhino’, ‘ostrich’, ‘salmon’} = 7
  • Jaccard Similarity: 1 / 7= 0.142857

The Jaccard Similarity Index turns out to be 0.142857. Since this number is fairly low, it indicates that the two sets are quite dissimilar. 

The Jaccard Distance

The Jaccard distance measures the dissimilarity between two datasets and is calculated as:

Jaccard distance = 1 – Jaccard Similarity

This measure gives us an idea of the difference between two datasets or the difference between them.

For example, if two datasets have a Jaccard Similarity of 80% then they would have a Jaccard distance of 1 – 0.8 = 0.2 or 20%.

Additional Resources

How to Calculate Jaccard Similarity in R
How to Calculate Jaccard Similarity in Python

Leave a Reply

Your email address will not be published. Required fields are marked *