String Matching with Mistakes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jkmyoung
    Recognized Expert Top Contributor
    • Mar 2006
    • 2057

    String Matching with Mistakes

    Hey, I'm trying to find an algorithm to approximately match different substrings within a master list, with some forgiveness on spelling.

    Say you are allowed 3 errors in the string match:
    Defining error: having to remove or add a character.
    If you have to replace a character, that counts as 2 errors.
    Also the search is case-insensitive for now.

    So abcd and abc have 1 error
    bcd and abcde have 2 errors
    abcd and bcde have 2 errors
    ace and abde have 3 errors

    eg: (*contrived example) say the master list contains
    Code:
    Alpha
    Beta
    Gamma
    Delta
    Epsilon
    Zeta
    Eta
    Theta
    Iota
    Kappa
    Lambda
    Mu
    Nu 
    Xi 
    Omicron 
    Pi 
    Rho 
    Sigma 
    Tau 
    Upsilon 
    Phi 
    Chi 
    Psi 
    Omega
    If someone types in "elta" they get:
    eta with 1 error
    delta with 1 error
    beta with 2 errors
    zeta with 2 errors
    theta with 3 errors

    The problem I have is, how do I determine when I should skip characters, add characters etc? This is particularly a problem when matching words with many of the same character, eg maybe
    banana - banna
    construction - constution
    construction - constitution (shouldn't match)

    Is there an easy way to determine on a mismatch whether I should:
    1. Ignore a character from source?
    2. Ignore a character from search substring?
    3. Backtrack in some other way?

    Or is there some way to 'hash' the string such that there are particular properties that we can use on the hash to determine if they are in fact similar?
  • RedSon
    Recognized Expert Expert
    • Jan 2007
    • 4980

    #2
    It sounds like you are looking for some kind of predictive text and or partial/full matching library.

    Entire companies are created to tackle this issue. This is a common problem with phones; hence T9. The T9 problem is not trivial.

    I would suggest looking at open source libraries to see how they match text to a dictionary based on input: http://sourceforge.net/projects/lib378/

    This project has been dormant for almost 2 years but you might be able to take it over and improve on it.

    Comment

    Working...