NGram Approximate String Matching

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    NGram Approximate String Matching

    INTRODUCTION
    An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el", "ll", "lo", "o ", " W", "Wo", "or", "rl", and "ld".

    USES
    NGram modeling is often used to model natural languages. It is also used to predict the next item in a sequence. I.E., given an NGram model with frequencies of occurrence, you can predict what the next item in the sequence is going to be.

    It is also used for approximate string matching. It is based on the observation that any two similar strings are likely to share many of the same NGrams. Because of this, it is also used to detect plagiarism. One of the benefits of using an ngram technique is the ability to index an ngram because the ngram of a string can be precalculated. This is in contrast to edit distance algorithms such as Levenshtein in which the string to compare and the string it is being compared to is part of the input. The ability to index an ngram leads to much faster searches but results in very large indexes.

    CODE
    The code below is written for VBScript but should be directly portable to VBA.

    The CreateNGram function takes a string and an integer as inputs. The integer defines the size of the n-gram you wish to use to create subsequences. It outputs a 2-dimensional array. The first item is the ngram and the second item is the frequency with which it occurs.

    The CompareNGram function takes two ngram arrays and outputs a double representing the similarity of the two arrays. The number returned is Dice's Coefficient of the two arrays. The higher the coefficient, the more similar the two strings.

    Code:
    Function CreateNGram(strInput, intN)
    	Dim arrNGram, intBound, i, j, strGram, didInc, arrTemp
    	
    	If Len(strInput) = 0 Then Exit Function
    	
    	ReDim arrNGram(Len(strInput) + 1, 1)
    	strInput = Chr(0) & UCase(Trim(strInput)) & Chr(0)
    	intBound = -1
    	
    	For i = 1 To Len(strInput)-intN+1
    		strGram = Mid(strInput, i, intN)
    		didInc = False
    		
    		For j = 0 To intBound
    			If strGram = arrNGram(j, 0) Then
    				arrNGram(j, 1) = arrNGram(j, 1) + 1
    				didInc = True
    				Exit For
    			End If
    		Next
    		
    		If Not didInc Then
    			intBound = intBound + 1
    			arrNGram(intBound, 0) = strGram
    			arrNGram(intBound, 1) = 1
    		End If
    	Next
    	
    	ReDim arrTemp(intBound, 1)
    	For i = 0 To intBound
    		arrTemp(i, 0) = arrNGram(i, 0)
    		arrTemp(i, 1) = arrNGram(i, 1)
    	Next
    	
    	CreateNGram = arrTemp
    End Function
    
    Function CompareNGram(arr1, arr2)
    	Dim i, j, intMatches, intCount1, intCount2
    	
    	intMatches = 0
    	intCount1 = 0
    	
    	For i = 0 To UBound(arr1)
    		intCount1 = intCount1 + arr1(i, 1)
    		intCount2 = 0
    	
    		For j = 0 To UBound(arr2)
    			intCount2 = intCount2 + arr2(j, 1)
    			
    			If arr1(i, 0) = arr2(j, 0) Then
    				If arr1(i, 1) >= arr2(j, 1) Then
    					intMatches = intMatches + arr2(j, 1)
    				Else
    					intMatches = intMatches + arr1(i, 1)
    				End If
    			End If
    		Next
    	Next
    	
    	CompareNGram = 2 * intMatches / (intCount1 + intCount2)
    End Function
    Last edited by Rabbit; Apr 27 '18, 05:09 PM.
  • GORKAL
    New Member
    • Jul 2015
    • 5

    #2
    Can u give a example queary

    Comment

    • Rabbit
      Recognized Expert MVP
      • Jan 2007
      • 12517

      #3
      You mean a sample call of the function? You call it the same way you would any other function.
      Code:
      CreateNGram("sample string", 2)

      Comment

      • soundsofthedeep
        New Member
        • Apr 2018
        • 1

        #4
        Hi Rabbit,

        Thanks for sharing this information! I'm trying to use your functions (in VBA) to compare some strings, and I'm getting a Type Mismatch error in the CreateNGram function. It seems like the arrNGram array and the arrTemp array are empty at all times during the run, resulting in an error during the function call. Any ideas?
        Last edited by soundsofthedeep; Apr 26 '18, 08:19 PM.

        Comment

        • Rabbit
          Recognized Expert MVP
          • Jan 2007
          • 12517

          #5
          The CreateNGram function returns a 2 dimensional array. It looks like that's an issue for the CompareNGram function because it's defaulting to a 1 dimensional empty array. I modified the function prototype above to fix the CompareNGram function so it just defaults to a Variant.

          If you're calling just CreateNGram by itself, then you need to treat the returned results as a 2 dimensional array.

          Comment

          Working...