INTRODUCTION
An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el", "ll", "lo", "o ", " W", "Wo", "or", "rl", and "ld".
USES
NGram modeling is often used to model natural languages. It is also used to predict the next item in a sequence. I.E., given an NGram model with frequencies of occurrence, you can predict what the next item in the sequence is going to be.
It is also used for approximate string matching. It is based on the observation that any two similar strings are likely to share many of the same NGrams. Because of this, it is also used to detect plagiarism. One of the benefits of using an ngram technique is the ability to index an ngram because the ngram of a string can be precalculated. This is in contrast to edit distance algorithms such as Levenshtein in which the string to compare and the string it is being compared to is part of the input. The ability to index an ngram leads to much faster searches but results in very large indexes.
CODE
The code below is written for VBScript but should be directly portable to VBA.
The CreateNGram function takes a string and an integer as inputs. The integer defines the size of the n-gram you wish to use to create subsequences. It outputs a 2-dimensional array. The first item is the ngram and the second item is the frequency with which it occurs.
The CompareNGram function takes two ngram arrays and outputs a double representing the similarity of the two arrays. The number returned is Dice's Coefficient of the two arrays. The higher the coefficient, the more similar the two strings.
An ngram is the subsequence of n units from a given sequence. These units can be words, characters, etc. For example, the 2-gram, or bigram, of the phrase "Hello World" is "He", "el", "ll", "lo", "o ", " W", "Wo", "or", "rl", and "ld".
USES
NGram modeling is often used to model natural languages. It is also used to predict the next item in a sequence. I.E., given an NGram model with frequencies of occurrence, you can predict what the next item in the sequence is going to be.
It is also used for approximate string matching. It is based on the observation that any two similar strings are likely to share many of the same NGrams. Because of this, it is also used to detect plagiarism. One of the benefits of using an ngram technique is the ability to index an ngram because the ngram of a string can be precalculated. This is in contrast to edit distance algorithms such as Levenshtein in which the string to compare and the string it is being compared to is part of the input. The ability to index an ngram leads to much faster searches but results in very large indexes.
CODE
The code below is written for VBScript but should be directly portable to VBA.
The CreateNGram function takes a string and an integer as inputs. The integer defines the size of the n-gram you wish to use to create subsequences. It outputs a 2-dimensional array. The first item is the ngram and the second item is the frequency with which it occurs.
The CompareNGram function takes two ngram arrays and outputs a double representing the similarity of the two arrays. The number returned is Dice's Coefficient of the two arrays. The higher the coefficient, the more similar the two strings.
Code:
Function CreateNGram(strInput, intN) Dim arrNGram, intBound, i, j, strGram, didInc, arrTemp If Len(strInput) = 0 Then Exit Function ReDim arrNGram(Len(strInput) + 1, 1) strInput = Chr(0) & UCase(Trim(strInput)) & Chr(0) intBound = -1 For i = 1 To Len(strInput)-intN+1 strGram = Mid(strInput, i, intN) didInc = False For j = 0 To intBound If strGram = arrNGram(j, 0) Then arrNGram(j, 1) = arrNGram(j, 1) + 1 didInc = True Exit For End If Next If Not didInc Then intBound = intBound + 1 arrNGram(intBound, 0) = strGram arrNGram(intBound, 1) = 1 End If Next ReDim arrTemp(intBound, 1) For i = 0 To intBound arrTemp(i, 0) = arrNGram(i, 0) arrTemp(i, 1) = arrNGram(i, 1) Next CreateNGram = arrTemp End Function Function CompareNGram(arr1, arr2) Dim i, j, intMatches, intCount1, intCount2 intMatches = 0 intCount1 = 0 For i = 0 To UBound(arr1) intCount1 = intCount1 + arr1(i, 1) intCount2 = 0 For j = 0 To UBound(arr2) intCount2 = intCount2 + arr2(j, 1) If arr1(i, 0) = arr2(j, 0) Then If arr1(i, 1) >= arr2(j, 1) Then intMatches = intMatches + arr2(j, 1) Else intMatches = intMatches + arr1(i, 1) End If End If Next Next CompareNGram = 2 * intMatches / (intCount1 + intCount2) End Function
Comment