• Examples showing Levenshtein Distance calculation. Source: Devopedia 2019.
• Illustrating the Wagner-Fischer algorithm. Source: Adapted from Levenshtein.net 2019.
• Normalizing edit distances. Source: Marzal and Vidal 1993, fig. 2.
• Levenshtein Distance obeys the triangle inequality. Source: Minerich 2012.

# Levenshtein Distance

arvindpdmn
1597 DevCoins
Last updated by arvindpdmn
on 2019-09-04 12:14:32
Created by arvindpdmn
on 2019-09-03 18:24:07

## Summary

Given two words, we can ask how similar are the two words. We can also ask this question of two sentences or string sequences. To quantify the similarity, we need a measure. Levenshtein Distance is such a measure.

By means of simple operations (such as insertion, deletion and substitution), we can determine how to transform one word or sequence into the other word or sequence. There could be many ways to achieve this. Levenshtein Distance is defined as the minimum number of operations required to make the two inputs equal. Lower the number, the more similar are the two inputs that are being compared.

There are a few algorithms to solve this distance problem. Levenshtein Distance is useful in many application domains including signal processing, natural language processing and computational biology.

Levenshtein Distance is also called edit distance.

## Milestones

1964

Fred Damerau at IBM looks the problem of matching an erroneous word to words in a dictionary. The error could be a missing character, an extra character, or a wrong character. By accounting for these error types, he was able to find a match in 95% of all error cases.

1965

In communication systems, information can get corrupted. Typically, 0s may becomes 1s, and vice versa. V.I. Levenshtein also considers the problem of extra bits getting introduced into the information stream or bits that go missing. He states that "the function r(x, y) defined on pairs of binary words as equal to the smallest number of deletions, insertions, and reversals that will transform x into y is a metric, and that a code K can correct s deletions, insertions, and reversals if and only if r(x, y) > 2s for any two different words x and y in K."

1974

The Wagner-Fischer Algorithm is invented to compute the edit distance between two strings. Algorithm has a time complexity of $$O(m\,n)$$ when comparing two strings of length m and n. The term edit distance is also coined by Wagner and Fischer. P.H. Sellers coins evolutionary distance as an alternative term.

1975

Bahl and Jelinek provide a stochastic interpretation of edit distance. Called similarity learning, the algorithm will learn pairwise similarity of words by analysing the corpus. Each edit operation has a cost and the idea is to learn these costs, such as by maximizing the likelihood of data.

1980

Masek and Paterson design an algorithm that brings down the worst case time complexity to $$O(m\,n/log_{\sigma}^{2}n)$$ but requires $$O(n)$$ more space. In practice, it can't beat the classical algorithm for text below 40GB.

1986

Oommen considers constrained edit distance where constraints are placed on edit operations. Example constraints could be "no more than k insertions" or "exactly k substitutions". They also give an algorithm that has $$O(m\,n\,min(m,n))$$ time and space complexity.

1993

Marzal and Vidal define Normalized Edit Distance (NED) and an algorithm for computing it. In 2000, two researchers improve the time complexity of finding NED from $$O(m\,n^{2})$$ to $$O(m\,n\,log\,n)$$.

2007

Cormode and Muthukrishnan consider substring moves and approximation of edit distance. With these modifications, they show that time complexity can be sub-quadratic.

2015

Indyk and Bačkurs prove that the problem of finding the edit distance can't be solved in less than quadratic-time complexity. Thus, we can't do better than the Wagner-Fischer algorithm in terms of time complexity.

## Discussion

• Could you explain Levenshtein Distance with an example?

Consider the two words "rain" and "shine". To transform "rain" into "shine", we can replace 'r' with 's', replace 'a' with 'h' and insert 'e'. Thus, the edit distance between these two words is 3. This is assuming that all operations have the same cost of 1. If we assign a higher cost to substitutions, say 2, then the edit distance becomes 2*2 + 1 = 5. To transform "shine" to "rain", the operations are reversed (insertions become deletions) but the edit distance is the same when costs are symmetric.

Consider the words "train" and "shine". If each letter is replaced, we need 5 operations but this not the minimum. We can do better by aligning "in", doing one insertion, two substitutions and one deletion. Edit distance is therefore 4.

A lower distance implies greater similarity between the two words. In NLP, we generally wish to minimise the distance. In computational biology, we wish to maximize similarity. In error correcting codes, we wish to maximize the distance so that one codeword is not easily confused with another.

• What are "weights" in the context of edit distance?

What we call "weights" in NLP are called "scores" in computational biology. In the simplest case, we can give the same weights to insertions, deletions and substitutions, regardless of the symbol.

In reality, some errors are more likely that others. Consider how keys are laid out on a standard computer keyboard. It's easy to mistype an adjacent key. Some spelling errors are more likely than others. Likewise, in optical character recognition, it's possible to easily misinterpret between 'o' and 'e', or 'm' and 'n'. Likewise, in a DNA sequence some deletions or insertions are more likely than others. Weights help us account for these varying probabilities.

Mathematically, given two words of length N and M, D(N,M) is the distance. Words are 1-indexed but D(i,j) is 0-indexed. Accounting for the weights, edit distance can be computed this way:

Initialization

$$D(0,0) = 0\\D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N\\D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M$$

Recurrence Relation

$$D(i,j) = min\begin{cases}{D(i-1,j) + del[x(i)]\\D(i,j-1) + ins[y(j)]\\D(i-1,j-1) + sub[x(i),y(j)]}\end{cases}$$

• Which are the algorithms that can compute the Levenshtein Distance?

There are many algorithms to compute the edit distance. Many of them are classified as dynamic programming algorithms. One of them is the Wagner-Fischer algorithm that we describe here.

The idea is to make a matrix of edit distances between all prefixes of one string and all prefixes of the other string. Levenshtein Distance is calculated by flood filling, that is, a path connecting cells of least edit distances. The approach is to start from upper left corner and move to the lower right corner. Moving horizontally implies insertion, vertically implies deletion, and diagonally implies substitution. Each cell minimizes the cost locally.

Consider the example of transforming "levenshtein" to "meilenstein" with equal weights of 1. There are actually two solutions, both having edit distance of 4. In one solution, we insert 'i' and replace 'v' with 'l': horizontal, then diagonal. In the other solution, we replace 'v' with 'i' and the insert 'l': diagonal, then horizonal.

If we wish to know the sequence of operations, we need to keep track of path, thus requiring more memory.

• What do you mean by Normalized Levenshtein Distance?

Consider two strings of same length 3 with edit distance of 2. Consider another two strings of same length 9 with edit distance of 3. We may say that the latter pair is more similar. To quantify the similarity, we normalize the edit distance.

One approach is to calculate edit distance as usual and then divide it by the number of operations, usually called the length of the edit path. This is called edit distance with post-normalization. The problem with this approach is that it may not produce the minimum normalized edit distance using Wagner-Fischer algorithm. Suppose weights are unequal for insertions/deletions and substitutions. It's possible to have a longer edit path that results in a lower normalized distance. Normalized Edit Distance (NED) algorithm finds this minimum.

For example, assume additions/deletions have a weight of 2, substitutions have a weight of 3. Consider the pair of words (abbb, aaab). When we normalize after Wagner-Fischer algorithm, we get 1.5 but with NED we get 1.33. It's also been observed that NED in most cases obeys the triangle inequality.

• How is Levenshtein Distance related to the triangle inequality?

In mathematics, edit distance can be seen as a metric in a metric space. In other words, the problem can be interpreted geometrically. The similarity between two words can be seen as the geometric distance between two points in the metric space. Such a metric obeys the triangle inequality. Given distance d, $$d(x,y) + d(y,z) \geq d(x,z)$$. A composition of two edits can't be smaller than a direct edit.

Let's consider the pair (rcik, rick) whose edit distance is 2. Another pair (rick, irkc) has edit distance 3. Now consider a direct edit involving (rcik, irkc), which is 4. This direct edit is less than the sum of the other two edits. Levenshtein Distance is indeed a metric and it obeys the triangle inequality.

Wagner and Fischer formally proved this. In fact, their algorithm to compute the edit distance relies on this. The triangle inequality is satisfied due to the constraint that no position in a string is transformed more than once. This implies that all operations can be applied in parallel.

• Are there any variants of Levenshtein Distance?

Based on the work of Fred Damerau (1964) and V.I. Levenshtein (1965) Damerau–Levenshtein Distance is sometimes used instead of the classical edit distance. With Damerau–Levenshtein Distance, transpositions are also allowed where two adjacent symbols can be swapped. Consider the pair (rcik, irkc). This has an edit distance of 4, due to 4 substitutions. But if transpositions are allowed then the Damerau–Levenshtein distance is 3: rcik -> rick -> irck -> irkc.

For automatic speech recognition, Levenshtein Distance is calculated on words rather than characters. It's also been extended to three or more dimensions.

Levenshtein Distance has been adapted to Chinese language to extract person-affiliation relations. Words "爱(love)" and "喜欢(like)" are synonyms and should be seen as similar. Likewise, "the apples" and "the sweet apples" are not too different. We can use weights to account for these.

• What are some alternatives to Levenshtein Distance?

The choice of a suitable similarity metric depends on the application. If we're comparing two sets where order is irrelevant, then Jaccard Distance can be used. A similar measure to Jaccard Distance is Sørensen–Dice coefficient.

Longest Common Subsequence (LCS) allows only insertion and deletion. Jaro Distance allows only transposition. Hamming Distance allows only substitution and thus can be used to compare only strings of same length. Episode distance is a measure that allows only insertions.

If strings are treated as vectors, Cosine Similarity can be used. This is based on the angle between the two vectors.

• What are some applications of Levenshtein Distance?

Computing the Levenshtein Distance has also been called the string-to-string correction problem. It's relevant to string matching problems occurring in various domains such as information retrieval, pattern recognition, error correction, and molecular genetics.

In NLP, Levenshtein Distance is useful for suggesting spelling corrections, detecting plagiarism, and aiding translators in translation memory systems. For example, if a word is misspelt as "ligting", Levenshtein Distance can suggest that "lighting" is most similar, which helps the writer correct the spelling. It's also been used to cluster words that share a root word.

Levenshtein Distance in fact started in signal processing, in particular, to see how errors in communications systems can be corrected. Another application is speech recognition. A perfect match of an audio signal is impossible and edit distance can find the most suitable match.

DNA is a sequence of letters such as A, C, G, T. Searching for specific sequences is often difficult due to measurement errors, mutations or evolutionary alterations. Thus, similarity of two sequences using Levenshtein Distance is more useful than exact matches.

• What software tools are available for calculating Levenshtein Distance?

In Python's NLTK package, we can compute Levenshtein Distance between two strings using nltk.edit_distance(). We can optionally set a higher cost to substitutions. Another optional argument if set to true permits transpositions and thus helps us calculate the Damerau–Levenshtein Distance.

In TensorFlow, tf.edit_distance() gives us Levenshtein Distance and optionally normalize it. In R language, package stringdist can calculate many string edit distances.

In some applications, when comparing sentences or paragraphs, we may want to exclude some words, or invoke some pre-processing tasks such as lemmatization or stemming. In such cases, we could first tokenize the input using nltk.word_tokenize(s1) (in Python) before calculating the edit distance.

If we wish to compute Levenshtein Distance by implementing known algorithms, Rosetta Code has a page sharing implementations in many languages. A similar list of implementations is at Wikibooks. Some implementations are more memory efficient than others. It's said that most applications will use heap memory sparingly. In general, a naive recursive implementation will be inefficient compared to a dynamic programming approach.

## Sample Code

• // Source: https://www.baeldung.com/java-levenshtein-distance
// Accessed: 2019-09-03

public class EditDistance {

static int calculate(String x, String y) {
int[][] dp = new int[x.length() + 1][y.length() + 1];

for (int i = 0; i <= x.length(); i++) {
for (int j = 0; j <= y.length(); j++) {
if (i == 0) {
dp[i][j] = j;
}
else if (j == 0) {
dp[i][j] = i;
}
else {
dp[i][j] = min(dp[i - 1][j - 1]
+ costOfSubstitution(x.charAt(i - 1), y.charAt(j - 1)),
dp[i - 1][j] + 1,
dp[i][j - 1] + 1);
}
}
}

return dp[x.length()][y.length()];
}

public static int costOfSubstitution(char a, char b) {
return a == b ? 0 : 1;
}

public static int min(int... numbers) {
return Arrays.stream(numbers)
.min().orElse(Integer.MAX_VALUE);
}
}

## Milestones

1964

Fred Damerau at IBM looks the problem of matching an erroneous word to words in a dictionary. The error could be a missing character, an extra character, or a wrong character. By accounting for these error types, he was able to find a match in 95% of all error cases.

1965

In communication systems, information can get corrupted. Typically, 0s may becomes 1s, and vice versa. V.I. Levenshtein also considers the problem of extra bits getting introduced into the information stream or bits that go missing. He states that "the function r(x, y) defined on pairs of binary words as equal to the smallest number of deletions, insertions, and reversals that will transform x into y is a metric, and that a code K can correct s deletions, insertions, and reversals if and only if r(x, y) > 2s for any two different words x and y in K."

1974

The Wagner-Fischer Algorithm is invented to compute the edit distance between two strings. Algorithm has a time complexity of $$O(m\,n)$$ when comparing two strings of length m and n. The term edit distance is also coined by Wagner and Fischer. P.H. Sellers coins evolutionary distance as an alternative term.

1975

Bahl and Jelinek provide a stochastic interpretation of edit distance. Called similarity learning, the algorithm will learn pairwise similarity of words by analysing the corpus. Each edit operation has a cost and the idea is to learn these costs, such as by maximizing the likelihood of data.

1980

Masek and Paterson design an algorithm that brings down the worst case time complexity to $$O(m\,n/log_{\sigma}^{2}n)$$ but requires $$O(n)$$ more space. In practice, it can't beat the classical algorithm for text below 40GB.

1986

Oommen considers constrained edit distance where constraints are placed on edit operations. Example constraints could be "no more than k insertions" or "exactly k substitutions". They also give an algorithm that has $$O(m\,n\,min(m,n))$$ time and space complexity.

1993

Marzal and Vidal define Normalized Edit Distance (NED) and an algorithm for computing it. In 2000, two researchers improve the time complexity of finding NED from $$O(m\,n^{2})$$ to $$O(m\,n\,log\,n)$$.

2007

Cormode and Muthukrishnan consider substring moves and approximation of edit distance. With these modifications, they show that time complexity can be sub-quadratic.

2015

Indyk and Bačkurs prove that the problem of finding the edit distance can't be solved in less than quadratic-time complexity. Thus, we can't do better than the Wagner-Fischer algorithm in terms of time complexity.

Author
No. of Edits
No. of Chats
DevCoins
5
0
1597
2100
Words
0
Chats
5
Edits
1
Likes
2548
Hits

## Cite As

Devopedia. 2019. "Levenshtein Distance." Version 5, September 4. Accessed 2020-04-02. https://devopedia.org/levenshtein-distance
• Site Map