How to Calculate Jaccard Similarity in Python


The Jaccard similarity index measures the similarity between two sets of data. It can range from 0 to 1. The higher the number, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

This tutorial explains how to calculate Jaccard Similarity for two sets of data in Python.

Example: Jaccard Similarity in Python

Suppose we have the following two sets of data:

import numpy as np

a = [0, 1, 2, 5, 6, 8, 9]
b = [0, 2, 3, 4, 5, 7, 9]

We can define the following function to calculate the Jaccard Similarity between the two sets:

#define Jaccard Similarity function
def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

#find Jaccard Similarity between the two sets 
jaccard(a, b)

0.4

The Jaccard Similarity between the two lists is 0.4.

Note that the function will return if the two sets don’t share any values:

c = [0, 1, 2, 3, 4, 5]
d = [6, 7, 8, 9, 10]

jaccard(c, d)

0.0

And the function will return if the two sets are identical:

e = [0, 1, 2, 3, 4, 5]
f = [0, 1, 2, 3, 4, 5]

jaccard(e, f)

1.0

The function also works for sets that contain strings:

g = ['cat', 'dog', 'hippo', 'monkey']
h = ['monkey', 'rhino', 'ostrich', 'salmon']

jaccard(g, h)

0.142857

You can also use this function to find the Jaccard distance between two sets, which is the dissimilarity between two sets and is calculated as 1 – Jaccard Similarity.

a = [0, 1, 2, 5, 6, 8, 9]
b = [0, 2, 3, 4, 5, 7, 9]

#find Jaccard distance between sets a and b
1 - jaccard(a, b)

0.6

Refer to this Wikipedia page to learn more details about the Jaccard Similarity Index.

Leave a Reply

Your email address will not be published. Required fields are marked *