Fuzzy String Matching: Double Metaphone Algorithm

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    Fuzzy String Matching: Double Metaphone Algorithm

    Introduction
    One of the big issues with name matching is how prone it is to error. There are many different ways people spell the same name, typos, mishearing what the other person said. There are a myriad of ways that free form language data can be corrupted. And it causes many headaches when you need to search/match on the bad data.

    There are many different approaches to get around it. Like the Levenshtein algorithm which calculates how many edits it would take to make one string match another string. Or the NGram algorithm that exams the smaller sequences that a string is composed of and compares them to the sequences of a nother string. Then there are phonetic algorithms that encode a string based on how it would "sound". Like the SoundEx or Double Metaphone algorithm.

    What is Double Metaphone?
    Double Metaphone is a phonetic algorithm that takes a string and produces 2 encodings on how it could be pronounced in spoken language. Two encodings are produced because a word can sometimes be pronounced multiple ways.

    It is used for English and contains many rules on the "sounds" of a string. It also tries to account for differences in the English pronunciation of Romanized words in Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origins.

    How Does it Work?
    The algorithm works by looking at hundreds of phonetic rules to create a phonetic string. Some examples of the rules it uses are:
    • P and B are encoded to the same sound. Unless the P is followed by an H which is then encoded to the F sound.
    • SCI is encoded to the S sound while SCO is encoded to the SK sound.
    • T and D are encoded to the same sound. Unless the T is followed by an H which is then encoded to the theta sound which is represented by the number 0.


    The algorithm produces 2, potentially, different pronunciations of the input word. The threshold parameter determines the max length of each pronunciation. So the LEFT(output, threshold) is the main pronunciation and the RIGHT(output, threshold) is a potentially different pronunciation. Hence the spaces when the the encoding produces a pronunciation with less characters than the threshold.

    The algorithm is coded to encode a single word at a time and doesn't handle non-alpha characters in the string. So those should probably be extracted out of the string before running the code. That or build the extraction into the function before any other processing.

    This is by design. When using a low threshold, longer words get truncated and may mistakenly match with shorter words. For example, with a threshold of 4, "foxtrot" and "fixed" have the same encoding.

    So instead of hardcoding the threshold, I included it as a parameter so the user has the option of using a larger threshold if they wish.

    Sample Implementation in VBScript
    This implementation takes a string s input along with a length i input and outputs 2 possible different phonetic encodings of the string, each of length I.

    For example, calling it with DoubleMetaphone ("Smith", 4) would return SM0 XMT , spaces included.

    You can use this to precalculate the pronunciation of names or words in your tables and compare the 2 different phonetic codes against all the other phonetic codes in the table to find names or words that sound similar.

    Code:
    Option Explicit
    
    MsgBox(DoubleMetaphone(InputBox("Enter String"), 6))
    
    Function DoubleMetaphone(strOriginal, intThreshhold)
    	Dim isSlavoGermanic, strPrimary, strSecondary, i, intJump, iB
    	Dim intLength, cP, cS, arr, x, intPad
    	
    	isSlavoGermanic = False
    	iB = 4
    	intPad = 6
    	x = iB
    	intLength = Len(strOriginal) + iB - 1
    	strOriginal = UCase(strOriginal)
    	
    	If (InStr(strOriginal, "W") + InStr(strOriginal, "K") + InStr(strOriginal, "CZ") + InStr(strOriginal, "WITZ")) <> 0 Then
    		isSlavoGermanic = True
    	End If
    	
    	ReDim arr(intLength + intPad + 1)
    	
    	For i = 0 To iB-1
    		arr(i) = vbTab
    	Next
    
    	For i = iB To intLength
    		arr(i) = Mid(strOriginal, i-iB+1, 1)
    	Next
    	
    	For i = intLength+1 To UBound(arr)
    		arr(i) = vbTab
    	Next
    	
    	Select Case (arr(x) & arr(x+1))
    		Case "AC"
    			strPrimary = "AKS"
    			strSecondary = "AKS"
    			x = x + 4
    		Case "GN", "KN", "PN", "PS"
    			x = x + 1
    		Case "HA", "HE", "HI", "HO", "HU", "HY"
    			strPrimary = "H"
    			strSecondary = "H"
    			x = x + 2
    		Case "WA", "WE", "WI", "WO", "WU", "WY"
    			strPrimary = "A"
    			strSecondary = "F"
    			x = x + 2
    		Case "WH"
    			strPrimary = "A"
    			strSecondary = "A"
    			x = x + 1
    		Case "SM", "SN", "SL", "SW"
    			strPrimary = "S"
    			strSecondary = "X"
    			x = x + 1
    		Case "GY"
    			strPrimary = "K"
    			strSecondary = "J"
    			x = x + 2
    	End Select
    	
    	If x = iB Then
    		If arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "JOSE" Then
    			If (x = iB And arr(x+4) = " ") Then
    				strPrimary = "HS"
    				strSecondary = "HS"
    				x = x + 4
    			End If
    		ElseIf arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) = "SUGAR" Then
    			strPrimary = "XK"
    			strSecondary = "SK"
    			x = x + 5
    		ElseIf arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CAESAR" Then
    			strPrimary = "SSR"
    			strSecondary = "SSR"
    			x = x + 6
    		ElseIf (arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CHARAC" Or _
    		arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) & arr(x+5) = "CHARIS" Or _
    		arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHOR" Or _
    		arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHYM" Or _
    		arr(x) & arr(x+1) & arr(x+2) & arr(x+3) = "CHEM") And _
    		arr(x) & arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) <> "CHORE" Then
    			strPrimary = "K"
    			strSecondary = "K"
    			x = x + 2
    		End If
    	End If
    	
    	If x = iB Then
    		Select Case arr(x) & arr(x+1) & arr(x+2)
    			Case "GES", "GEP", "GEB", "GEL", "GEY", "GIB", "GIL", "GIN", "GIE", "GEI", "GER"
    				strPrimary = "K"
    				strSecondary = "J"
    				x = x + 2
    			Case "GHI"
    				strPrimary = "J"
    				strSecondary = "J"
    				x = x + 3
    			Case "AGN", "EGN", "IGN", "OGN", "UGN", "UGY"
    				If Not isSlavoGermanic Then
    					strPrimary = "AKN"
    					strSecondary = "AN"
    					x = x + 3
    				End If
    		End Select
    	End If
    
    	If x = iB Then
    		Select Case arr(x)
    			Case "X"
    				strPrimary = "S"
    				strSecondary = "S"
    				x = x + 1
    			Case "A", "E", "I", "O", "U", "Y"
    				strPrimary = "A"
    				strSecondary = "A"
    				x = x + 1
    			Case "J"
    				strPrimary = "J"
    				strSecondary = "A"
    				x = x + 1
    		End Select
    	End If
    	
    	Do While x <= intLength
    		If Len(strPrimary) >= intThreshhold Then
    			Exit Do
    		End If
    		
    		intJump = 1
    		cP = arr(x)
    		cS = arr(x)
    		
    		Select Case arr(x)
    			Case "A", "E", "I", "O", "U", "Y"
    				cP = ""
    				cS = ""
    				
    			Case "B"
    				cP = "P"
    				cS = "P"
    				
    			Case "Ç"
    				cP = "S"
    				cS = "S"
    				
    			Case "C"
    				If x > iB+1 And arr(x-2) <> "A" And arr(x-2) <> "E" And arr(x-2) <> "I" And arr(x-2) <> "O" And arr(x-2) <> "U" And _
    				arr(x-2) <> "Y" And arr(x-1) & arr(x+1) = "AH" And ((arr(x+2) <> "I" And arr(x+2) <> "E") Or _
    				arr(x-2) & arr(x+2) & arr(x+3) = "BER" Or arr(x-2) & arr(x+2) & arr(x+3) = "MER") Then
    					cP = "K"
    					cS = "K"
    					intJump = 2
    				ElseIf arr(x+1) & arr(x+2) & arr(x+3) = "HIA" Then
    					cP = "K"
    					cS = "K"
    					intJump = 4
    				ElseIf arr(x+1) = "H" Then
    					If x > iB And arr(x+2) & arr(x+3) = "AE" Then
    						cP = "K"
    						cS = "X"
    						intJump = 2
    					ElseIf arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VAN " Or _
    					arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VON " Or _
    					arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" Or arr(x+2) = "T" Or arr(x+2) = "S" Or _
    					arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ORHES" Or _
    					arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ARHIT" Or _
    					arr(x-2) & arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "ORHID" Or _ 
    					((arr(x-2) = "A" Or arr(x-2) = "E" Or arr(x-2) = "O" Or arr(x-2) = "U" Or x = iB) And _
    					(arr(x+2) = "L" Or arr(x+2) = "R" Or arr(x+2) = "N" Or arr(x+2) = "M" Or arr(x+2) = "B" Or _
    					arr(x+2) = "H" Or arr(x+2) = "F" Or arr(x+2) = "V" Or arr(x+2) = "W" Or arr(x+2) = " "))Then
    						cP = "K"
    						cS = "K"
    						intJump = 2
    					Else
    						intJump = 2
    						
    						If x > iB Then
    							If arr(iB) & arr(iB+1) = "MC" Then
    								cP = "K"
    								cS = "K"
    							Else
    								cP = "X"
    								cS = "K"
    							End If
    						Else
    							cP = "X"
    							cS = "X"
    						End If
    					End If
    				ElseIf arr(x+1) = "Z" And arr(x-2) & arr(x-1) <> "WI" Then
    					cP = "S"
    					cS = "X"
    					intJump = 2
    				ElseIf arr(x+1) & arr(x+2) & arr(x+2) = "CIA" Then
    					cP = "X"
    					cS = "X"
    					intJump = 3
    				ElseIf arr(x+1) = "C" And Not (x = iB+1 And arr(iB) = "M") Then
    					If (arr(x+2) = "I" Or arr(x+2) = "E" Or arr(x+2) = "H") And arr(x+2) & arr(x+3) <> "HU" Then
    						If arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "UCEE" Or _
    						arr(x-1) & arr(x+1) & arr(x+2) & arr(x+3) = "UCES" Then
    							cP = "KS"
    							cS = "KS"
    							intJump = 3
    						Else
    							cP = "X"
    							cS = "X"
    							intJump = 3
    						End If
    					Else
    						cP = "K"
    						cS = "K"
    						intJump = 2
    					End If
    				ElseIf arr(x+1) = "K" Or arr(x+1) = "G" Or arr(x+1) = "Q" Then
    					cP = "K"
    					cS = "K"
    					intJump = 2
    				ElseIf arr(x+1) = "I" Or arr(x+1) = "E" Or arr(x+1) = "Y" Then
    					If arr(x+1) & arr(x+2) = "IO" Or arr(x+1) & arr(x+2) = "IE" Or arr(x+1) & arr(x+2) = "IA" Then
    						cP = "S"
    						cS = "X"
    						intJump = 2
    					Else
    						cP = "S"
    						cS = "S"
    						intJump = 2
    					End If
    				Else
    					cP = "K"
    					cS = "K"
    						
    					If arr(x+1) & arr(x+2) = " C" Or arr(x+1) & arr(x+2) = " Q" Or arr(x+1) & arr(x+2) = " G" Then
    						intJump = 3
    					Else
    						If (arr(x+1) = "C" Or arr(x+1) = "K" Or arr(x+1) = "Q") And _
    						arr(x+1) & arr(x+2) <> "CE" And arr(x+1) & arr(x+2) <> "CI" Then 
    							intJump = 2
    						End If
    					End If
    				End If
    					
    			Case "D"
    				If arr(x+1) = "G" Then
    					If arr(x+2) = "I" Or _
    					arr(x+2) = "E" Or _
    					arr(x+2) = "Y" Then
    						cP = "J"
    						cS = "J"
    						intJump = 3
    					Else
    						cP = "TK"
    						cS = "TK"
    						intJump = 2
    					End If
    				ElseIf arr(x+1) = "T" Then
    					cP = "T"
    					cS = "T"
    					intJump = 2
    				Else
    					cP = "T"
    					cS = "T"
    				End If
    				
    			Case "G"
    				If arr(x+1) = "H" Then
    					If x <> iB And arr(x-1) <> "A" And arr(x-1) <> "E" And arr(x-1) <> "I" _
    					And arr(x-1) <> "O" And arr(x-1) <> "U" And arr(x-1) <> "Y" Then
    						cP = "K"
    						cS = "K"
    						intJump = 2
    					ElseIf (x > iB+1 And (arr(x-2) = "B" Or arr(x-2) = "H" Or arr(x-2) = "D")) Or _
    					(x > iB+2 And (arr(x-3) = "B" Or arr(x-3) = "H" Or arr(x-3) = "D")) Or _
    					(x > iB+3 And (arr(x-4) = "B" Or arr(x-4) = "H")) Then
    						cP = ""
    						cS = ""
    						intJump = 2
    					Else
    						If x > iB+2 And arr(x-1) = "U" And _
    						(arr(x-3) = "C" Or arr(x-3) = "G" Or arr(x-3) = "L" Or arr(x-3) = "R" Or arr(x-3) = "T") Then
    							cP = "F"
    							cS = "F"
    							intJump = 2
    						ElseIf x > iB And arr(x-1) <> "I" Then
    							cP = "K"
    							cS = "K"
    							intJump = 2
    						Else
    							cP = ""
    							cS = ""
    						End If
    					End If
    				ElseIf arr(x+1) = "N" Then
    					cS = "KN"
    					intJump = 2
    					
    					If arr(x+2) & arr(x+3) <> "EY" And Not isSlavoGermanic Then
    						cP = "N"
    					Else
    						cP = "KN"
    					End If
    				ElseIf arr(x+1) & arr(x+2) = "LI" And Not isSlavoGermanic Then
    					cP = "KL"
    					cS = "L"
    					intJump = 2
    				ElseIf (arr(x+1) & arr(x+2) = "ER" Or arr(x+1) = "Y") And _
    				arr(x-1) <> "E" And arr(x-1) <> "I" And _
    				arr(x-1) & arr(x+1) <> "RY" And _
    				arr(x-1) & arr(x+1) <> "OY" And _
    				arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "DANGER" And _
    				arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "RANGER" And _
    				arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) & arr(iB+4) & arr(iB+5) <> "MANGER" Then
    					cP = "K"
    					cS = "J"
    					intJump = 2
    				ElseIf arr(x+1) = "E" Or arr(x+1) = "I" Or arr(x+1) = "Y" Or _
    				arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "AGGI" Or _
    				arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "OGGI" Then
    					If arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VON " Or _
    					arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "VAN " Or _
    					arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" Or _
    					arr(x+1) & arr(x+2) = "ET" Then
    						cP = "K"
    						cS = "K"
    						intJump = 2
    					Else
    						cP = "J"
    						If arr(x+1) & arr(x+2) & arr(x+3) & arr(x+4) = "IER " Then
    							cS = "J"
    							intJump = 3
    						Else
    							cS = "K"
    							intJump = 2
    						End If
    					End If
    				Else
    					cP = "K"
    					cS = "K"
    				End If
    				
    			Case "H"
    				If (arr(x-1) = "A" Or _
    				arr(x-1) = "E" Or _
    				arr(x-1) = "I" Or _
    				arr(x-1) = "O" Or _
    				arr(x-1) = "U" Or _
    				arr(x-1) = "Y") And _
    				(arr(x+1) = "A" Or _
    				arr(x+1) = "E" Or _
    				arr(x+1) = "I" Or _
    				arr(x+1) = "O" Or _
    				arr(x+1) = "U" Or _
    				arr(x+1) = "Y") Then
    					intJump = 2
    				Else
    					cP = ""
    					cS = ""
    				End If
    				
    			Case "J"
    				If arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) = "SAN " Then
    					cP = "H"
    					cS = "H"
    				Else
    					If Not isSlavoGermanic And ( _
    					arr(x-1) = "A" Or _
    					arr(x-1) = "E" Or _
    					arr(x-1) = "I" Or _
    					arr(x-1) = "O" Or _
    					arr(x-1) = "U" Or _
    					arr(x-1) = "Y") And ( _
    					arr(x+1) = "A" Or _
    					arr(x+1) = "O") Then
    						cS = "H"
    					Else
    						If x = intLength Then
    							cS = ""
    						Else
    							If arr(x+1) = "L" Or arr(x+1) = "T" Or arr(x+1) = "K" Or _
    							arr(x+1) = "S" Or arr(x+1) = "N" Or arr(x+1) = "M" Or _
    							arr(x+1) = "B" Or arr(x+1) = "Z" Or _
    							arr(x-1) = "S" Or arr(x-1) = "K" Or arr(x-1) = "L" Then
    								cP = ""
    								cS = ""
    							End If
    						End If
    					End If
    				End If
    				
    			Case "L"
    				If arr(x+1) = "L" Then
    					intJump = 2
    					
    					If ((x = intLength-2 And ( _
    					arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ILLO" Or _
    					arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ILLA" Or _
    					arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ALLE" _
    					)) Or (( _
    					arr(intLength-1) & arr(intLength) = "AS" Or _
    					arr(intLength-1) & arr(intLength) = "OS" Or _
    					arr(intLength) = "A" Or arr(intLength) = "O") And _
    					arr(x-1) & arr(x) & arr(x+1) & arr(x+2) = "ALLE")) Then
    						cS = ""
    					End If
    				End If
    				
    			Case "M"
    				If arr(x-1) & arr(x) & arr(x+1) = "UMB" And _
    				(x = intLength-1 Or arr(x+2) & arr(x+3) = "ER") Then
    					intJump = 2
    				End If
    				
    			Case "P"
    				Select Case arr(x+1)
    					Case "H"
    						cP = "F"
    						cS = "F"
    						intJump = 2
    					Case "B"
    						intJump = 2
    				End Select
    				
    			Case "Q"
    				cP = "K"
    				cS = "K"
    				
    			Case "R"
    				If x = intLength And Not isSlavoGermanic And _
    				arr(x-2) & arr(x-1) = "IE" And _
    				arr(x-4) & arr(x-3) <> "ME" And _
    				arr(x-4) & arr(x-3) <> "MA" Then
    					cP = ""
    				End If
    				
    			Case "S"
    				If arr(x+1) = "L" And (arr(x-1) = "I" Or arr(x-1) = "Y") Then
    					cP = ""
    					cS = ""
    				ElseIf arr(x+1) = "H" And _
    				arr(x+2) & arr(x+3) & arr(x+4) <> "EIM" And _
    				arr(x+2) & arr(x+3) & arr(x+4) <> "OEK" And _
    				arr(x+2) & arr(x+3) & arr(x+4) <> "OLM" And _
    				arr(x+2) & arr(x+3) & arr(x+4) <> "OLZ" Then
    					intJump = 2
    					cP = "X"
    					cS = "X"
    				ElseIf Not isSlavoGermanic And ( _
    				arr(x+1) & arr(x+2) = "IA" Or _
    				arr(x+1) & arr(x+2) = "IO") Then
    					intJump = 3
    					cS = "X"
    				ElseIf arr(x+1) = "Z" Then
    					cS = "X"
    					intJump = 2
    				ElseIf arr(x+1) = "C" Then
    					intJump = 3
    					
    					If arr(x+2) = "H" Then
    						If arr(x+3) & arr(x+4) = "OO" Or _
    						arr(x+3) & arr(x+4) = "ER" Or _
    						arr(x+3) & arr(x+4) = "EN" Or _
    						arr(x+3) & arr(x+4) = "UY" Or _
    						arr(x+3) & arr(x+4) = "ED" Or _
    						arr(x+3) & arr(x+4) = "EM" Then
    							cS = "SK"
    								
    							If arr(x+3) & arr(x+4) = "ER" Or _
    							arr(x+3) & arr(x+4) = "EN" Then
    								cP = "X"
    							Else
    								cP = "SK"
    							End If
    						Else
    							cP = "X"
    							
    							If x <> iB Or arr(iB+3) = "W" Or arr(iB+3) = "A" Or _
    							arr(iB+3) = "E" Or arr(iB+3) = "I" Or arr(iB+3) = "O" Or _
    							arr(iB+3) = "U" Or arr(iB+3) = "Y" Then
    								cS = "X"
    							End If
    						End If
    					ElseIf arr(x+2) = "I" Or arr(x+2) = "E" Or arr(x+2) = "Y" Then
    					Else
    						cP = "SK"
    						cS = "SK"
    					End If
    				ElseIf x = intLength And arr(x-1) = "I" And ( _
    				arr(x-2) = "A" Or arr(x-2) = "O") Then
    					cP = ""
    				End If
    				
    			Case "T"
    				If arr(x+1) & arr(x+2) & arr(x+3) = "ION" _
    				Or arr(x+1) & arr(x+2) = "IA" _
    				Or arr(x+1) & arr(x+2) = "CH" Then
    					cP = "X"
    					cS = "X"
    					intJump = 3
    				ElseIf (arr(x+1) = "H" Or arr(x+1) & arr(x+2) = "TH") And _
    				(arr(x+2) & arr(x+3) <> "OM" And _
    				arr(x+2) & arr(x+3) <> "AM" And _
    				arr(iB) & arr(iB+1) & arr(iB+2) <> "SCH" And _
    				arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) <> "VAN " And _
    				arr(iB) & arr(iB+1) & arr(iB+2) & arr(iB+3) <> "VON ") Then
    					cP = "0"
    					intJump = 2
    				ElseIf arr(x+1) = "D" Then
    					intJump = 2
    				End If
    				
    			Case "V"
    				cP = "F"
    				cS = "F"
    				
    			Case "W"
    				If arr(x+1) = "R" Then
    					cP = "R"
    					cS = "R"
    					intJump = 2
    				ElseIf arr(iB) & arr(iB+1) & arr(iB+2) = "SCH" _
    				Or (x = intLength And ( _
    				arr(x-1) = "A" Or _
    				arr(x-1) = "E" Or _
    				arr(x-1) = "I" Or _
    				arr(x-1) = "O" Or _
    				arr(x-1) = "U" Or _
    				arr(x-1) = "Y")) _
    				Or ((arr(x-1) = "E" Or arr(x-1) = "O") And _
    				(arr(x+1) & arr(x+2) & arr(x+3) = "SKI" Or _
    				arr(x+1) & arr(x+2) & arr(x+3) = "SKY")) Then
    					cP = ""
    					cS = "F"
    				ElseIf arr(x+1) & arr(x+2) & arr(x+3) = "ICZ" _
    				Or arr(x+1) & arr(x+2) & arr(x+3) = "ITZ" Then
    					cP = "TS"
    					cS = "FX"
    					intJump = 4
    				Else
    					cP = ""
    					cS = ""
    				End If
    				
    			Case "X"
    				If x = intLength And _
    				(arr(x-3) & arr(x-2) & arr(x-1) = "IAU" Or _ 
    				arr(x-3) & arr(x-2) & arr(x-1) = "EAU" Or _
    				arr(x-2) & arr(x-1) = "AU" Or _
    				arr(x-2) & arr(x-1) = "OU") Then
    					cP = ""
    					cS = ""
    				Else
    					cP = "KS"
    					cS = "KS"
    				End If
    				
    				If arr(x+1) = "C" Then
    					intJump = 2
    				End If
    				
    			Case "Z"
    				If arr(x+1) = "H" Then
    					cP = "J"
    					cS = "J"
    				ElseIf (arr(x+1) & arr(x+2) = "ZO" Or _
    				arr(x+1) & arr(x+2) = "ZI" Or _
    				arr(x+1) & arr(x+2) = "ZA") _
    				Or (isSlavoGermanic And x <> iB And arr(x-1) = "T") Then
    					cP = "S"
    					cS = "TS"
    				Else
    					cP = "S"
    					cS = "S"
    				End If
    		End Select
    
    		strPrimary = strPrimary & cP
    		strSecondary = strSecondary & cS
    		
    		If arr(x) = arr(x+1) And arr(x) <> "C" Then
    			intJump = intJump + 1
    		End If
    		x = x + intJump
    	Loop
    	
    	For i = 1 To intThreshhold
    		strPrimary = strPrimary & " "
    		strSecondary = strSecondary & " "
    	Next
    	
    	DoubleMetaphone = Left(strPrimary, intThreshhold) & Left(strSecondary, intThreshhold)
    End Function
    Last edited by Rabbit; Dec 22 '15, 11:39 PM.
  • jforbes
    Recognized Expert Top Contributor
    • Aug 2014
    • 1107

    #2
    This looks like a lot of fun. I can't wait to have time to experiment with it. Thanks.

    Comment

    • zmbd
      Recognized Expert Moderator Expert
      • Mar 2012
      • 5501

      #3
      J,
      Just played with it some in Rabbit's roughdraft,
      Code ports almost directly into Access-VBA with only a need to wrap line 3 into a sub-procedure.

      Comment

      • Rabbit
        Recognized Expert MVP
        • Jan 2007
        • 12517

        #4
        Thanks Z. And thanks for testing it against another implementation online. That helps ease my mind on whether or not it was implemented properly.

        Comment

        • zmbd
          Recognized Expert Moderator Expert
          • Mar 2012
          • 5501

          #5
          Rabbit,
          Still playing with this, I had initially just hand typed twenty or so random words in to an on-line implementation.
          Now I'm playing in a database and I noticed that there is a space between the two groups, one of the online versions ran the two groups together, one did not.. and I closed the window and I have my history set to flush:

          so for "agencies" you get "AJNSS AKNXS"

          So in my people template:
          Code:
          SELECT people_FirstName
             , DoubleMetaphone([people_firstName],6)
               AS DMP
          FROM tbl_people;
          yields
          Note the name Foxtrot, here there is no space....
          Code:
          people_FirstName       DMP:
          Alpha                  ALF   ALF   
          Beta                   PT    PT    
          Charlie                XRL   XRL   
          Delta                  TLT   TLT   
          Echo                   AX    AK    
          [iCODE]Foxtrot                FKSTRTFKSTRT[/iCODE]
          Golf                   KLF   KLF   
          Hotel                  HTL   HTL   
          India                  ANT   ANT   
          Juliet                 JLT   ALT   
          Kilo                   KL    KL    
          Lima                   LM    LM    
          Mike                   MK    MK    
          November               NFMPR NFMPR 
          October                AKTPR AKTPR 
          Papa                   PP    PP    
          Quebec                 KPK   KPK   
          Romeo                  RM    RM    
          Sierra                 SR    SR    
          Tango                  TNK   TNK   
          Uniform                ANFRM ANFRM 
          Victor                 FKTR  FKTR  
          Whisky                 ASK   ASK   
          Xray                   SR    SR    
          Yankee                 ANK   ANK   
          Zulu                   SL    SL
          I've not stepped thru the code yet...
          Last edited by zmbd; Dec 22 '15, 06:19 PM.

          Comment

          • Rabbit
            Recognized Expert MVP
            • Jan 2007
            • 12517

            #6
            The algorithm produces 2, potentially, different pronunciations of the input word. The threshold parameter determines the max length of each pronunciation. So the LEFT(output, threshold) is the main pronunciation and the RIGHT(output, threshold) is a potentially different pronunciation. Hence the spaces when there the encoding produces a pronunciation with less characters than the threshold. Threshold of course is a misnomer. It's more of a limit or max length. Not sure why I used threshold at the time.

            As for Foxtrot, the algorithm is coded to encode a single word at a time and doesn't handle non-alpha characters in the string. Such as the space before Foxtrot. So those should probably be extracted out of the string before running the code. That or build the extraction into the function before any other processing.

            Comment

            • zmbd
              Recognized Expert Moderator Expert
              • Mar 2012
              • 5501

              #7
              + Thank you for explaining that, not being familiar with this algorithm. Between the two different implementations and some of the articles I was skimming also smashing the two parts together in to one string made it unclear for me as to how the proper output should be.

              + The apparent space in "Foxtrot" is an artifact of the formatting in the posting, sorry, didn't intend to mislead you... :(
              SO I should be seeing "FKSTRT FKSTRT" for "foxtrot"?
              Directly in the immedate pane:
              Code:
              ?DoubleMetaphone("Foxtrot",6)
              FKSTRTFKSTRT
              
              ?DoubleMetaphone("November",6)
              NFMPR NFMPR
              
              ?DoubleMetaphone("October",6)
              AKTPR AKTPR
              Using foxtrot, STOP at line 586:
              Code:
              ?"'" & strPrimary & "'" & len(strPrimary)
              'FKSTRT      ' 12
              ?"'" & strSecondary & "'" & len(strPrimary)
              'FKSTRT      ' 12
              ?"'" & Left(strPrimary, intThreshhold) & "'"
              'FKSTRT'
              ?"'" & Left(strSecondary, intThreshhold) & "'"
              'FKSTRT'
              Should we have a check for length of strPrimary and strSecondary against the intThreshhold and add one or does the fact that the threshold matches the length of the encoding correctly merge the two?
              Last edited by zmbd; Dec 22 '15, 08:05 PM.

              Comment

              • Rabbit
                Recognized Expert MVP
                • Jan 2007
                • 12517

                #8
                Lines 587-590 will pad out the different pronunciations to the threshold if necessary.

                Line 592 will truncate the different pronunciations if it exceeds the threshold.

                When the encoding matches or exceeds the threshold, the 2 pronunciations run together without any spaces in between.

                Comment

                • zmbd
                  Recognized Expert Moderator Expert
                  • Mar 2012
                  • 5501

                  #9
                  So, if understand correctly, this is by design?

                  One of the implementations (http://swoodbridge.com/DoubleMetaPhone/mptest.php3) appears to be using 4 as the threshold and the result returns with "foxtrot yields 'FKST' and 'FKST'"

                  so I am still confused about the correct output.

                  Comment

                  • Rabbit
                    Recognized Expert MVP
                    • Jan 2007
                    • 12517

                    #10
                    This is by design. When using a low threshold, longer words get truncated and may mistakenly match with shorter words. For example, with a threshold of 4, foxtrot and fixed have the same encoding.

                    So instead of hardcoding the threshold, I included it as a parameter so the user has the option of using a larger threshold if they wish.

                    Comment

                    • Rabbit
                      Recognized Expert MVP
                      • Jan 2007
                      • 12517

                      #11
                      I've added some of the clarification that came up as a result of our discussion into the main article.
                      Last edited by zmbd; Dec 22 '15, 11:44 PM. Reason: [z{you're awesome, thnk you!}]

                      Comment

                      Working...