Levenshtein Approximate String Matching

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    Levenshtein Approximate String Matching

    INTRODUCTION
    The Levenshtein Distance algorithm is an algorithm used to calculate the minimum number of edits required to transform one string into another string using addition, deletion, and substitution of characters.

    USES
    The most common use of the function is for approximate string matching. Since the function returns the minimum number of edits required to transform one string into another, the user can set the threshhold at which one string is considered a match to another.

    CODE
    The function below accepts two strings as its input and returns an integer representing the minimum number of required edits to transform string 1 into string 2.

    Because the computation time is O(n*m), you should be wary about using it on very long strings.

    A common modification of this algorithm is to allow transposition of characters. Allowing transposition of adjacent characters is the Damerau-Levenshtein Distance Algorithm.

    Code:
    Function Levenshtein(str1 As String, str2 As String) As Integer
    	Dim arrLev, intLen1 As Integer, intLen2 As Integer, i As Integer
    	Dim j As Integer, arrStr1, arrStr2, intMini, As Integer
    	
    	intLen1 = Len(str1)
    	ReDim arrStr1(intLen1 + 1)
    	intLen2 = Len(str2)
    	ReDim arrStr2(intLen2 + 1)
    	ReDim arrLev(intLen1 + 1, intLen2 + 1)
    	
    	arrLev(0, 0) = 0
    	For i = 1 To intLen1
    		arrLev(i, 0) = i
    		arrStr1(i) = Mid(str1, i, 1)
    	Next
    	
    	For j = 1 To intLen2
    		arrLev(0, j) = j
    		arrStr2(j) = Mid(str2, j, 1)
    	Next
    	
    	For j = 1 To intLen2
    		For i = 1 To intLen1
    			If arrStr1(i) = arrStr2(j) Then
    				arrLev(i, j) = arrLev(i-1, j-1)
    			Else
    				intMini = arrLev(i-1, j) 'deletion
    				If intMini > arrLev(i, j-1) Then intMini = arrLev(i, j-1) 'insertion
    				If intMini > arrLev(i-1, j-1) Then intMini = arrLev(i-1, j-1) 'deletion
    				
    				arrLev(i, j) = intMini + 1
    			End If
    		Next
    	Next
    	
    	Levenshtein = arrLev(intLen1, intLen2)
    End Function
  • beacon
    Contributor
    • Aug 2007
    • 579

    #2
    Hi Rabbit,

    This is interesting. Do you have any practical examples of when this function may come in handy?

    I've had a number of occasions when I've need to compare strings to one another, but I can't think of a time when I might need to know how many character modifications it would take to make two strings match.

    Thanks,
    beacon

    Comment

    • Rabbit
      Recognized Expert MVP
      • Jan 2007
      • 12517

      #3
      It would mostly be used to find mispelled duplicates.

      For example, if I had data of the following form
      Code:
      LASTNAME
      smith
      smit
      smoth
      snith
      snoth
      If I use the query
      Code:
      SELECT LASTNAME,
         Levenshtein("smith", LASTNAME) AS Distance
      FROM Table1
      Then the results would be
      Code:
      LASTNAME Distance
      smith    0
      smit     1
      smoth    1
      snith    1
      snoth    2
      Then I can, say, pull everything where the distance is 1 or less and examine those more closely to see if they're duplicates. I don't have to stop at just one field though, to get a better idea of whether or not a name is a match, I could calculate the distances for first name, last name, and address and if they're all within a certain threshhold, I could make a reasonably safe assumption that it's a duplicate record.

      Comment

      • Rabbit
        Recognized Expert MVP
        • Jan 2007
        • 12517

        #4
        Technically you could use it to calculate the distance between any two binary files. You could compare two jpegs and see how different they are.

        Anyways, here's the Damerau-Levenshtein algorithm. Since I code mostly in VBScript now, I didn't add in the data types like I did for the first one.

        This algorithm allows for transpositions of adjacent characters. Therefore, neighborhood and nieghborhood is distance 1 in this function as opposed to distance 2 in the prior function.
        Code:
        Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
             Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
             Dim intDB, intI1, intJ1, intD
         
             str1 = UCase(str1)
             str2 = UCase(str2)
             intLen1 = Len(str1)
             intLen2 = Len(str2)
             intTotalLen = intLen1 + intLen2
             ReDim arrStr1(intLen1)
             ReDim arrStr2(intLen2)
             ReDim arrDA(intSize)
             ReDim arrDistance(intLen1 + 2, intLen2 + 2)
             arrDistance(0, 0) = intTotalLen
         
             For i = 0 To intSize - 1
                 arrDA(i) = 0
             Next
         
             For i = 0 To intLen1
                 arrDistance(i + 1, 1) = i
                 arrDistance(i + 1, 0) = intTotalLen
             Next
         
             For i = 1 To intLen1
                 arrStr1(i - 1) = Asc(Mid(str1, i, 1))
             Next
         
             For j = 0 To intLen2
                 arrDistance(1, j + 1) = j
                 arrDistance(0, j + 1) = intTotalLen
             Next
         
             For j = 1 To intLen2
                 arrStr2(j - 1) = Asc(Mid(str2, j, 1))
             Next
         
             For i = 1 To intLen1
                 intDB = 0
         
                 For j = 1 To intLen2
                     intI1 = arrDA(arrStr2(j - 1))
                     intJ1 = intDB
         
                     If arrStr1(i - 1) = arrStr2(j - 1) Then
                         intD = 0
                     Else
                         intD = 1
                     End If
         
                     If intD = 0 Then intDB = j
         
                     intMini = arrDistance(i, j) + intD
                     If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
                     If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
                     If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1
         
                     arrDistance(i + 1, j + 1) = intMini
                 Next
         
                 arrDA(arrStr1(i - 1)) = i
             Next
         
             DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
         End Function
        Last edited by Rabbit; Jan 6 '16, 06:26 PM. Reason: Fixed bugs

        Comment

        • webbeacon
          New Member
          • Dec 2015
          • 30

          #5
          Thank you so much for sharing this, Rabbit. Really helped me out!

          For anyone else planning to use this for matching lists of names, I would also recommend incorporating (a function to) strip all spaces and punctuation from the strings, further improving the accuracy of your matches. (Well technically I guess it makes them less accurate, but for my purposes it's more effective to disregard punctuation).
          Last edited by zmbd; Jan 6 '16, 10:44 PM. Reason: [op{added some add'l info for anyone else working on name matching}][z{removed link to other forum}]

          Comment

          • jforbes
            Recognized Expert Top Contributor
            • Aug 2014
            • 1107

            #6
            Thanks for the code Rabbit. It is greatly appreciated and I learned a great deal from it.

            We had a need where we wanted to compare two Bill of Materials (BOM) side by side to see their differences. I created a Form that had two SubForms showing the two Bill of Materials to be compared. So, as the user would click on a line on one Bill of Material, anything similar to that line would be highlighted on the other Bill of Material. This was accomplished in the OnPaint event of the SubFoms by passing the current line item and a reference line item (from the other BOM) to the Levenshtein method and comparing it to the Value of a slider. If the value is less than the value of the slider, the row is highlighted. It works quite well.

            In developing this, I got pretty familiar with the Algorithm and decided that I wanted to try a slightly different approach. I noticed that the time to run the algorithm grew exponentially with the length of the strings, basically some initial time to call the function, then the time to perform a nested loop:
            Time=Tstart + (Tcompare * S1length * S2length)

            This wasn’t a shock, and after I realized this, I notice that Rabbit had already pointed this out as O(n*m) in his original post.

            Knowing this, I tried a similar approach by counting the occurrence of each character for both strings and then subtracting the count from the total length, which would be the count of changes needed to make the strings identical. This changed the amount to time it would take to perform the comparison to grow linearly in comparison to the length of the strings to compare:
            Time=Tstart + (Tcompare * S1length) + (Tcompare * S2length)

            The two algorithms produced very similar results. I did some comparison testing on the times to run for both approaches and for me, if the string lengths were under fifteen characters the Levenshtein version was faster, but when the string lengths increased to twenty characters or longer, the getAltStringDis tance version consistently took less time. Since the application I was working on had on average fifty characters per string, the getAltStringDis tance version often worked faster.

            One last thing to note on the getAltStringDis tance version is that it is only performing counts on characters in the mMatchChars constant, currently "abcdefghijklmn opqrstuvwxyz012 3456789 -()/". So, it probably wouldn’t work well for the comparison of .JPEG as this list would have to grow to include every character.

            Lastly, I created a wrapper to call the most appropriate method based on string length. Best of both worlds, right? What I ended up with:
            Code:
            Public Const mMatchChars = "abcdefghijklmnopqrstuvwxyz0123456789 -()/"
            
                'getStringDistance
            Public Function getStringDistance(ByRef s1 As String, ByRef s2 As String) As Integer
                If Len(s1) > 17 And Len(s2) > 17 Then
                    getStringDistance = getAltStringDistance(s1, s2)
                Else
                    getStringDistance = Levenshtein(s1, s2)
                End If
            End Function
            
                'getAltStringDistance
            Public Function getAltStringDistance(ByVal s1 As String, ByVal s2 As String) As Integer
                
                Dim sTemp1 As String
                Dim sTemp2 As String
                Dim lLargest As Long
                Dim lMatches As Long
                Dim lRemainder As Long
                Dim lCount1 As Long
                Dim lCount2 As Long
                Dim lPos As Long
                    
                sTemp1 = s1
                sTemp2 = s2
                If Len(s1) > Len(s2) Then
                    lLargest = Len(s1)
                Else
                    lLargest = Len(s2)
                End If
                
                ' Loop through possible characters
                ' Perform the count and then remove
                ' the found characters from the strings
                ' A count of the Remaining characters is
                ' the Distance between the strings, sorta
                For lPos = 1 To Len(mMatchChars)
                    sTemp1 = Replace(s1, Mid(mMatchChars, lPos, 1), "")
                    sTemp2 = Replace(s2, Mid(mMatchChars, lPos, 1), "")
                    lCount1 = (Len(s1) - Len(sTemp1))
                    lCount2 = (Len(s2) - Len(sTemp2))
                    If lCount1 > lCount2 Then
                        lMatches = lMatches + lCount2
                    Else
                        lMatches = lMatches + lCount1
                    End If
                    s1 = sTemp1
                    s2 = sTemp2
                Next lPos
                
                If Len(s1) > Len(s2) Then
                    lRemainder = Len(s1)
                Else
                    lRemainder = Len(s2)
                End If
                getAltStringDistance = ((lLargest - lRemainder) - lMatches)
                
            End Function
            Thanks again for the code Rabbit!

            Comment

            • Rabbit
              Recognized Expert MVP
              • Jan 2007
              • 12517

              #7
              Hi jforbes, glad you were able to get some use out of the code. Your new algorithm is reminiscent of the ngram algorithm. The main difference is that the ngram algorithm takes into account the order of the characters. https://bytes.com/topic/access/insig...tring-matching

              Another one you might be interested in is the double metaphone algorithm which calculates the "sound" of a word.


              Both algorithms allow you to precalculate the values to make future comparisons quicker. Something that you can't do with Levenshtein.

              Comment

              Working...