Using Soundex (OT?)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ricky Romaya

    Using Soundex (OT?)

    Hi,

    I'm curious about soundex. All I know that it's a way for making spelling-
    error-tolerant word matching. What I want to know is whether the soundex
    algorithm are made exclusively for english language, or can it be used for
    any arbitrary language with satisfactory performance (by 'satisfactory
    performance' I meant that it can detect at least 80% spelling-errors). What
    about PHP soundex support?

    TIA
  • Andy Hassall

    #2
    Re: Using Soundex (OT?)

    On 05 Feb 2005 19:09:04 GMT, Ricky Romaya <something@some where.com> wrote:
    [color=blue]
    >I'm curious about soundex. All I know that it's a way for making spelling-
    >error-tolerant word matching. What I want to know is whether the soundex
    >algorithm are made exclusively for english language, or can it be used for
    >any arbitrary language with satisfactory performance (by 'satisfactory
    >performance' I meant that it can detect at least 80% spelling-errors). What
    >about PHP soundex support?[/color]

    Soundex is for English words, based on English pronunciation rules. See:


    There's also a reference there to Metaphone, which is supposedly better, but
    also English-based.

    --
    Andy Hassall / <andy@andyh.co. uk> / <http://www.andyh.co.uk >
    <http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

    Comment

    • Markku Uttula

      #3
      Re: Using Soundex (OT?)

      Andy Hassall wrote:[color=blue]
      > Soundex is for English words, based on English pronunciation rules.
      > See: http://en.wikipedia.org/wiki/Soundex[/color]

      You *can* of course cook up your own Soundex-functions with values
      created based on other languages the algorithm is very easy. For some
      languages it might be rather easy, but possibly not worth the effort;
      though the original algorithm is for english, it will work "quite
      well" for many other languages too.

      It's worthwhile to note that soundex (and similar functions) only work
      for individual words, and that by using it you aren't supposed to
      detect spelling errors. The best use for soundex is when you're
      searching for names, addresses or the like and don't know how it is
      actually written, but know what it sounds like - you can have the
      soundex values stored in the database with other data and when you do
      a search, you first look for the exact string the user entered. If
      this doesn't return enough results, you count the soundex value for
      the user input and try with that. This way you get results that "sound
      same" ... so they're propably close to what you really were looking
      for. I think a similar approach is used on the search engine at
      www.php.net (I can't be certain though, but it seems like that - see
      http://fi.php.net/manual-lookup.php?pattern=sundeks for example:)

      --
      Markku Uttula

      Comment

      • Markku Uttula

        #4
        Re: Using Soundex (OT?)

        Markku Uttula wrote:[color=blue]
        > http://fi.php.net/manual-lookup.php?pattern=sundeks for example:)[/color]

        I hate to comment on my own postings, but I need to add that php.net
        manual page for Soundex is quite good to read. It also has links to
        some other functions (Metaphone and Levenshtein) that might prove
        usefull.

        --
        Markku Uttula

        Comment

        • Chung Leong

          #5
          Re: Using Soundex (OT?)

          "Ricky Romaya" <something@some where.com> wrote in message
          news:Xns95F519E 2650C4rickyrale xandriacc@66.25 0.146.159...[color=blue]
          > Hi,
          >
          > I'm curious about soundex. All I know that it's a way for making spelling-
          > error-tolerant word matching. What I want to know is whether the soundex
          > algorithm are made exclusively for english language, or can it be used for
          > any arbitrary language with satisfactory performance (by 'satisfactory
          > performance' I meant that it can detect at least 80% spelling-errors).[/color]
          What[color=blue]
          > about PHP soundex support?
          >
          > TIA[/color]

          Soundex is really only good for surnames. You can't use it for general text
          search since it'd yield too many irrelevant results. It was designed for
          grouping similiar surnames and not for handling typos. Names that are
          spelled very differently could end up with the same value. For example,
          Sznyder, Schneider, and Snyder are all given S536, while Smith, Smit, and
          Schmidt get S530.

          Soundex can handle surnames of foreign origins. For example, the variants of
          my own--Leong, Leung, Liang, Long--all have the same soundex value.


          Comment

          • Philip Ronan

            #6
            Re: Using Soundex (OT?)

            Chung Leong wrote:
            [color=blue]
            > Soundex is really only good for surnames. You can't use it for general text
            > search since it'd yield too many irrelevant results. It was designed for
            > grouping similiar surnames and not for handling typos. Names that are
            > spelled very differently could end up with the same value. For example,
            > Sznyder, Schneider, and Snyder are all given S536, while Smith, Smit, and
            > Schmidt get S530.
            >
            > Soundex can handle surnames of foreign origins. For example, the variants of
            > my own--Leong, Leung, Liang, Long--all have the same soundex value.[/color]

            I found that a combination of the metaphone and Levenshtein function works
            better for first names -- I'm using it to suggest alternatives in a
            dictionary here:

            <http://www.japanesetra nslator.co.uk/your-name-in-japanese/>

            It's supposed to be a dictionary of English names, but a lot of them are
            actually of foreign origin (like most "English" names, I guess).

            if I remember correctly, the Soundex function was a bit too clumsy and threw
            out hundreds of alternatives for some unrecognized spellings, and none for
            others.

            Instead I use the metaphone function to search for possible alternatives,
            and then sort them based on their Levenshtein distance from the search term.
            It works pretty well.

            --
            phil [dot] ronan @ virgin [dot] net



            Comment

            • Ricky Romaya

              #7
              Re: Using Soundex (OT?)

              "Markku Uttula" <markku.uttula@ disconova.com> wrote in news:rVfNd.1805
              $UV3.1572@reade r1.news.jippii. net:
              [color=blue]
              > Markku Uttula wrote:[color=green]
              >> http://fi.php.net/manual-lookup.php?pattern=sundeks for example:)[/color]
              >
              > I hate to comment on my own postings, but I need to add that php.net
              > manual page for Soundex is quite good to read. It also has links to
              > some other functions (Metaphone and Levenshtein) that might prove
              > usefull.
              >[/color]
              Well, could someone suggest some way to mimic google's 'suggested
              keyword' functionality which works across different languages? I've done
              some reading about soundex, metaphone, and levenshtein, which IMHO are
              designed exclusively for english.

              Also, I've read about aspell & pspell on PHP manual. Sadly, it doesn't
              work on win32 platform (and not to mention it's an additional module,
              which I don't have the authority to install). Anyway to simulate them on
              pure PHP?

              TIA

              Comment

              Working...