The Jaccard similarity index measures the similarity between two sets of data. It can range from 0 to 1. The higher the number, the more similar the two sets of data.
The Jaccard similarity index is calculated as:
Jaccard Similarity = (number of observations in both sets) / (number in either set)
Or, written in notation form:
J(A, B) = |A∩B| / |A∪B|
This tutorial explains how to calculate Jaccard Similarity for two sets of data in Python.
Example: Jaccard Similarity in Python
Suppose we have the following two sets of data:
import numpy as np a = [0, 1, 2, 5, 6, 8, 9] b = [0, 2, 3, 4, 5, 7, 9]
We can define the following function to calculate the Jaccard Similarity between the two sets:
#define Jaccard Similarity function def jaccard(list1, list2): intersection = len(list(set(list1).intersection(list2))) union = (len(list1) + len(list2)) - intersection return float(intersection) / union #find Jaccard Similarity between the two sets jaccard(a, b) 0.4
The Jaccard Similarity between the two lists is 0.4.
Note that the function will return 0 if the two sets don’t share any values:
c = [0, 1, 2, 3, 4, 5] d = [6, 7, 8, 9, 10] jaccard(c, d) 0.0
And the function will return 1 if the two sets are identical:
e = [0, 1, 2, 3, 4, 5] f = [0, 1, 2, 3, 4, 5] jaccard(e, f) 1.0
The function also works for sets that contain strings:
g = ['cat', 'dog', 'hippo', 'monkey'] h = ['monkey', 'rhino', 'ostrich', 'salmon'] jaccard(g, h) 0.142857
You can also use this function to find the Jaccard distance between two sets, which is the dissimilarity between two sets and is calculated as 1 – Jaccard Similarity.
a = [0, 1, 2, 5, 6, 8, 9] b = [0, 2, 3, 4, 5, 7, 9] #find Jaccard distance between sets a and b 1 - jaccard(a, b) 0.6
Related: How to Calculate Jaccard Similarity in R
Refer to this Wikipedia page to learn more details about the Jaccard Similarity Index.