String Matching with Mistakes

jkmyoung

Recognized Expert Top Contributor

Join Date: Mar 2006

Posts: 2057
#1

String Matching with Mistakes

Apr 14 '10, 02:50 PM

Hey, I'm trying to find an algorithm to approximately match different substrings within a master list, with some forgiveness on spelling.

Say you are allowed 3 errors in the string match:
Defining error: having to remove or add a character.
If you have to replace a character, that counts as 2 errors.
Also the search is case-insensitive for now.

So abcd and abc have 1 error
bcd and abcde have 2 errors
abcd and bcde have 2 errors
ace and abde have 3 errors

eg: (*contrived example) say the master list contains

Code:

Alpha Beta Gamma Delta Epsilon Zeta Eta Theta Iota Kappa Lambda Mu Nu Xi Omicron Pi Rho Sigma Tau Upsilon Phi Chi Psi Omega

If someone types in "elta" they get:
eta with 1 error
delta with 1 error
beta with 2 errors
zeta with 2 errors
theta with 3 errors

The problem I have is, how do I determine when I should skip characters, add characters etc? This is particularly a problem when matching words with many of the same character, eg maybe
banana - banna
construction - constution
construction - constitution (shouldn't match)

Is there an easy way to determine on a mismatch whether I should:
1. Ignore a character from source?
2. Ignore a character from search substring?
3. Backtrack in some other way?

Or is there some way to 'hash' the string such that there are particular properties that we can use on the hash to determine if they are in fact similar?
Tags: errors, forgiveness, matching, searching, substring
RedSon

Recognized Expert Expert

Join Date: Jan 2007

Posts: 4980
#2

Apr 14 '10, 06:26 PM

It sounds like you are looking for some kind of predictive text and or partial/full matching library.

Entire companies are created to tackle this issue. This is a common problem with phones; hence T9. The T9 problem is not trivial.

I would suggest looking at open source libraries to see how they match text to a dictionary based on input: http://sourceforge.net/projects/lib378/

This project has been dormant for almost 2 years but you might be able to take it over and improve on it.
Comment

String Matching with Mistakes

String Matching with Mistakes

Comment