The Jaccard similarity index measures the similarity between two sets of data. It can range from 0 to 1. The higher the number, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

**Jaccard Similarity** = (number of observations in both sets) / (number in either set)

Or, written in notation form:

**J(A, B) = **|A∩B| / |A∪B|

This tutorial explains how to calculate Jaccard Similarity for two sets of data in Python.

**Example: Jaccard Similarity in Python**

Suppose we have the following two sets of data:

import numpy as np a = [0, 1, 2, 5, 6, 8, 9] b = [0, 2, 3, 4, 5, 7, 9]

We can define the following function to calculate the Jaccard Similarity between the two sets:

#define Jaccard Similarity function def jaccard(list1, list2): intersection = len(list(set(list1).intersection(list2))) union = (len(list1) + len(list2)) - intersection return float(intersection) / union #find Jaccard Similarity between the two sets jaccard(a, b) 0.4

The Jaccard Similarity between the two lists is **0.4**.

Note that the function will return **0 **if the two sets don’t share any values:

c = [0, 1, 2, 3, 4, 5] d = [6, 7, 8, 9, 10] jaccard(c, d) 0.0

And the function will return **1 **if the two sets are identical:

e = [0, 1, 2, 3, 4, 5] f = [0, 1, 2, 3, 4, 5] jaccard(e, f) 1.0

The function also works for sets that contain strings:

g = ['cat', 'dog', 'hippo', 'monkey'] h = ['monkey', 'rhino', 'ostrich', 'salmon'] jaccard(g, h) 0.142857

You can also use this function to find the **Jaccard distance **between two sets, which is the *dissimilarity* between two sets and is calculated as 1 – Jaccard Similarity.

a = [0, 1, 2, 5, 6, 8, 9] b = [0, 2, 3, 4, 5, 7, 9] #find Jaccard distance between setsaandb1 - jaccard(a, b) 0.6

**Related: **How to Calculate Jaccard Similarity in R

*Refer to this Wikipedia page to learn more details about the Jaccard Similarity Index.*

In the function jaccard, the variable union is not correct. That is the length of the two lists, it should be the length of the two sets.

The code is incorrect.

union = (len(list1) + len(list2)) – intersection

should be:

union = (len(set(list1)) + len(set(list2))) – intersection