Jump to content

Wikipedia:Reference desk/Archives/Computing/2024 October 22

From Wikipedia, the free encyclopedia
Computing desk
< October 21 << Sep | October | Nov >> Current desk >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


October 22

[edit]

Levenshtein Edit Distance and String Length

[edit]

I've noticed that Levenshtein edit distance tends to favor similarly length strings to best matches from a human point of view. For example, a human would say that Don and Donaldx are very similar. The edit distance is 4. But, A better match is Sam because the edit distance is 3. Instead of favoring Donaldx that has the original string Don with 4 extra letters, it favors Sam because it is the same length. Is there a popular alternative to Levenshtein edit distance that handles this issue? I have been researching and I feel it may be better to use Smith-Waterman because you get a local alignment with the edit distance of that alignment. It would align Don with Donaldx for a local alignment distance of 0 and align Don to Sam with a local alignment distance of 3. So, it would prefer Don to Donaldx. But, that algorithm is specifically for DNA sequences, not for words, so I feel that I would be misusing it in this case. 12.116.29.106 (talk) 13:24, 22 October 2024 (UTC)[reply]

Donovan, surely (Donald has only 3 extra letters). I've seen it pointed out that the distance from "Unit" to "Turkmenistan" is 8 edits, while the distance to "United States" is 9. You may prefer Jaro–Winkler distance, which respects prefixes.  Card Zero  (talk) 16:58, 22 October 2024 (UTC)[reply]
Thanks for pointing out the miscalculation. I added an 'x' to Donald to make it an edit distance of 4 and I am now looking at Jaro-Winkler distance. 12.116.29.106 (talk) 17:43, 22 October 2024 (UTC)[reply]
It is easy to define other edit distances. Select a (parametrized) set of "elementary" edit operations, assigning a positive weight to each operation. If it is possible to transform string A into string B by some sequence of elementary edits, define the path length from A to B as the lowest sum of the weights of the operations of such a sequence, taken over all sequences that perform the transformation. If impossible, define the path length as +∞. (If this set includes the operations of deleting or inserting a single character, then any string can be reached from any other string, so path lengths are then always finite.) The path length, thus defined, satisfies the directed triangle inequality. In general, the path length from B to A is not necessarily the same as the path length from A to B. If each elementary edit operation has an inverse operation with the same weight, we obtain symmetry.
The set of operations of deleting or inserting a single character, or replacing one character by another, each with weight 1, results in the Levenshtein distance. If you want an edit distance that corresponds better to our human intuition of similarity, include common types of spelling errors, such as tranpsosition of two adjacent characters, which gives you the Damerau–Levenshtein distance. To make this even better reflect typical human typos, the set should be extended to such typos as erronneously doubling a character, or conversely omiting one of two equal characters. Going further, deletion of say a word tail of N characters could be an elementary operation whose weight is only N, instead of a sequence of N deletion operations of a single character whose path length is N times the weight 1. I am not aware of an actual effort in this vein to define an edit distance that corresponds better to human intuition – which will be both language-dependent and culture-dependent.  --Lambiam 19:40, 22 October 2024 (UTC)[reply]
Metaphone (for Approximate string matching) is somewhat like that, trying to predict people's pronunciation-based guesses at spelling.  Card Zero  (talk) 05:51, 23 October 2024 (UTC)[reply]