In statistics, the **Jaro-Winkler similarity **is a way to measure the similarity between two strings.

The **Jaro similarity** (sim_{j}) between two strings is defined as:

sim_{j} = 1/3 * ( m /|s_{1}| + m/|s_{2}| + (m-t)/m )

where:

**m**: Number of matching characters- Two characters from s
_{1}and s_{2}are considered matching if they are the same and not farther than [max(|s_{1}|, |s_{2}|) / 2] – 1 characters apart.

- Two characters from s
**|s**,_{1}|**|s**: The length of the first and second strings, respectively_{2}|**t**: Number of transpositions- Calculated as the number of matching (but different sequence order) characters divided by 2.

The **Jaro-Winkler similarity** (sim_{w}) is defined as:

sim_{w} = sim_{j} + lp(1 – sim_{j})

where:

**sim**: The Jaro similarity between two strings, s_{j}_{1}and s_{2}**l**: Length of the common prefix at the start of the string (max of 4 characters)**p**: Scaling factor for how much the score is adjusted upwards for having common prefixes. Typically this is defined as p = 0.1 and should not exceed p = 0.25.

The Jaro-Winkler similarity between two strings is always between 0 and 1 where:

**0**indicates no similarity between the strings**1**indicates that the strings are an exact match

**Note**: The Jaro-Winkler *distance* would be defined as 1 – sim_{w}.

The following example shows how to calculate the Jaro-Winkler similarity between two strings in practice.

**Example: Calculating Jaro-Winkler Similarity Between Two Strings**

Suppose we have the following two strings:

- String 1 (s
_{1}):**mouse** - String 2 (s
_{2}):**mute**

First, let’s calculate the Jaro Similarity between these two strings:

sim_{j} = 1/3 * ( m /|s_{1}| + m/|s_{2}| + (m-t)/m )

where:

**m**: Number of matching characters- Two characters from s
_{1}and s_{2}are considered matching if they are the same and not farther than [max(|s_{1}|, |s_{2}|) / 2] – 1 characters apart.

- Two characters from s

In this case, [max(|s_{1}|, |s_{2}|) / 2] – 1 is calculated as 5/2 – 1 = 1.5. We would define three letters as matching: m, u, e. Thus,** m = 3**.

**|s**,_{1}|**|s**: The length of the first and second strings, respectively_{2}|

In this case, **|s _{1}| = 5 **and

**|s**.

_{1}| = 4**t**: Number of transpositions- Calculated as the number of matching (but different sequence order) characters divided by 2.

In this case, there are three matching characters but they’re already in the same sequence order, so **t = 0**.

Thus, we would calculate the Jaro Similarity as:

sim_{j} = 1/3 * ( 3/5 + 3/4 + (3-0)/3 ) = 0.78333.

Next, let’s calculate the Jaro-Winkler similarity (sim_{w}) as:

sim_{w} = sim_{j} + lp(1 – sim_{j})

In this case, we would calculate:

sim_{w} = 0.78333 + (1)*(0.1)(1 – 0.78333) = 0.805.

The Jaro-Winkler similarity between the two strings is **0.805**.

Since this value is close to 1, it tells us that the two strings are very similar.

We can confirm this is correct by calculating the Jaro-Winkler similarity between the two strings in R:

library(stringdist) #calculate Jaro-Winkler similarity between 'mouse' and 'mute' 1 - stringdist("mouse", "mute", method = "jw", p=0.1) [1] 0.805

This matches the value that we calculated by hand.

**Additional Resources**

The following tutorials explain how to calculate other similarity metrics:

An Introduction to Bray-Curtis Dissimilarity

An Introduction to the Jaccard Similarity Index