Looking for library to estimate likeness of two strings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • agenkin@gmail.com

    Looking for library to estimate likeness of two strings

    Are there any Python libraries implementing measurement of similarity
    of two strings of Latin characters?

    I'm writing a script to guess-merge two tables based on people's
    names, which are not necessarily spelled the same way in both tables
    (especially the given names). I would like some function that would
    help me make the best guess.

    Many thanks in advance!
  • Tim Chase

    #2
    Re: Looking for library to estimate likeness of two strings

    Are there any Python libraries implementing measurement of similarity
    of two strings of Latin characters?
    It sounds like you're interested in calculating the Levenshtein
    distance:



    which gives you a measure of how different they are. A measure
    of "0" is that the inputs are the same. The more different the
    two strings are, the greater the resulting output of the function.

    Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
    N=len(word2)) from my understanding of the code I've seen.
    However it really is the best approximation I've seen of a "how
    similar are these two strings" function. Googling for

    python levenshtein distance

    brings up oodles of hits.

    -tkc



    Comment

    • Steven D'Aprano

      #3
      Re: Looking for library to estimate likeness of two strings

      On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
      Jeff Schwab wrote:
      ....
      >If the strings happen to be the same length, the Levenshtein distance
      >is equivalent to the Hamming distance.
      ....
      I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:
      ....
      In [4]: hamdist('abcdef ', 'fabcde')
      Out[4]: 6
      >
      In [5]: levenshtein('ab cdef', 'fabcde')
      Out[5]: 2

      I can confirm Robert's results, using a different implementation of the
      Levenshtein edit distance. It isn't true that Levenshtein simplifies to
      Hamming when the strings are equal length.


      For what it's worth, here's my implementation, which I wrote some years
      ago. I doubt it's optimal.





      def levenshtein(s1, s2):
      """Return the Levenshtein edit distance between two strings.

      s1 and s2 should be two arbitrary length strings. Returns an
      integer count of the edit distance between the strings.
      """
      m, n = len(s1), len(s2)
      if m == 0: return n
      elif n == 0: return m
      # Create an array with rows 0..m and columns 0..n.
      array = [None]*(m+1)
      for i in range(m+1):
      array[i] = [None]*(n+1)
      # Initialise the first row to 0..n.
      for i in range(n+1):
      array[0][i] = i
      # Initialize the first column to 0..m.
      for i in range(m+1):
      array[i][0] = i
      # Measure the differences.
      for i in range(1, m+1):
      c1 = s1[i-1]
      for j in range(1, n+1):
      c2 = s2[j-1]
      cost = int(c1 != c2)
      x = array[i][j-1] + 1 # Cell immediately above plus one.
      y = array[i-1][j] + 1 # Cell immediately left plus one.
      z = array[i-1][j-1] + cost # Cell diagonally above and
      # to the left, plus the cost.
      array[i][j] = min(x, y, z)
      # When done, the bottom-right cell contains the Levenshtein distance.
      return array[m][n]




      --
      Steven

      Comment

      • Jeff Schwab

        #4
        Re: Looking for library to estimate likeness of two strings

        Steven D'Aprano wrote:
        On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
        >
        >Jeff Schwab wrote:
        ...
        >>If the strings happen to be the same length, the Levenshtein distance
        >>is equivalent to the Hamming distance.
        ...
        >I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:
        ...
        I can confirm Robert's results, using a different implementation of the
        Levenshtein edit distance. It isn't true that Levenshtein simplifies to
        Hamming when the strings are equal length.
        Whoops! My mistake.

        Comment

        • Daniel Fetchinson

          #5
          Re: Looking for library to estimate likeness of two strings

          Are there any Python libraries implementing measurement of similarity
          of two strings of Latin characters?
          >
          I'm writing a script to guess-merge two tables based on people's
          names, which are not necessarily spelled the same way in both tables
          (especially the given names). I would like some function that would
          help me make the best guess.
          >
          Many thanks in advance!

          Hi folks, just went through this thread and a related one from 2006
          and I was wondering what the best solution is for using these string
          metrics in a database search. If I want to query the database for a
          string or something that is close to it (close being defined by one of
          the string metrics discussed above) it seems I have to select each and
          every word from the database and compare it with the query word which
          is very ineffective.

          Very concretely if I have an sqlite db with SQLObject as the ORM what
          would be the most effective way of doing such a query?

          Cheers,
          Daniel

          Comment

          • Lee Capps

            #6
            Re: Looking for library to estimate likeness of two strings

            At 14:01 Wed 06 Feb 2008, agenkin@gmail.c om wrote:
            >Are there any Python libraries implementing measurement of similarity
            >of two strings of Latin characters?
            >
            >I'm writing a script to guess-merge two tables based on people's
            >names, which are not necessarily spelled the same way in both tables
            >(especially the given names). I would like some function that would
            >help me make the best guess.
            >
            >Many thanks in advance!
            I used difflib.get_clo se_matches for something similar:

            Source code: Lib/difflib.py This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences i...


            HTH.

            --
            Lee Capps
            Technology Specialist
            CTE Resource Center


            Comment

            • JKPeck

              #7
              Re: Looking for library to estimate likeness of two strings

              On Feb 7, 6:11 am, Lee Capps <lca...@ctereso urce.orgwrote:
              At 14:01 Wed 06 Feb 2008, agen...@gmail.c om wrote:
              >
              Are there any Python libraries implementing measurement of similarity
              of two strings of Latin characters?
              >
              I'm writing a script to guess-merge two tables based on people's
              names, which are not necessarily spelled the same way in both tables
              (especially the given names). I would like some function that would
              help me make the best guess.
              >
              Many thanks in advance!
              >
              I used difflib.get_clo se_matches for something similar:
              >
              Source code: Lib/difflib.py This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences i...

              >
              HTH.
              >
              --
              Lee Capps
              Technology Specialist
              CTE Resource Center
              Algorithms typically used for name comparisons include soundex,
              nysiis, and levenshtein distance. The last is more general and
              variations are used in spell checkers. You can probably Google for
              Python versions. You can find implementations of these comparison
              functions at
              www.spss.com/devcentral in the extendedTransfo rms.py module.
              (Requires a login but free).

              HTH,
              Jon Peck

              Comment

              • agenkin@gmail.com

                #8
                Re: Looking for library to estimate likeness of two strings

                Hi, All:

                Many thanks for the excellent leads. I've also found several
                functions to find phonetic similarity between English names: the
                mentioned above soundex, then, also, one called metaphone. I'm now
                thinking of the best way to use some combination of these functions.

                Comment

                • agenkin@gmail.com

                  #9
                  Re: Looking for library to estimate likeness of two strings

                  On Feb 7, 2:37 am, "Daniel Fetchinson" <fetchin...@goo glemail.com>
                  wrote:
                  Hi folks, just went through this thread and a related one from 2006
                  and I was wondering what the best solution is for using these string
                  metrics in a database search. If I want to query the database for a
                  string or something that is close to it (close being defined by one of
                  the string metrics discussed above) it seems I have to select each and
                  every word from the database and compare it with the query word which
                  is very ineffective.
                  I have never used sqlite database, but Postgres has a module that
                  implements levenshtein(), soundex() and metaphone() functions, so you
                  can do something like this:

                  SELECT * FROM s WHERE soundex(name) = soundex('john') ;
                  SELECT * FROM s WHERE difference(name , 'john') 2;


                  Comment

                  • Guilherme Polo

                    #10
                    Re: Looking for library to estimate likeness of two strings

                    2008/2/7, agenkin@gmail.c om <agenkin@gmail. com>:
                    On Feb 7, 2:37 am, "Daniel Fetchinson" <fetchin...@goo glemail.com>
                    wrote:
                    >
                    Hi folks, just went through this thread and a related one from 2006
                    and I was wondering what the best solution is for using these string
                    metrics in a database search. If I want to query the database for a
                    string or something that is close to it (close being defined by one of
                    the string metrics discussed above) it seems I have to select each and
                    every word from the database and compare it with the query word which
                    is very ineffective.
                    >
                    >
                    I have never used sqlite database, but Postgres has a module that
                    implements levenshtein(), soundex() and metaphone() functions, so you
                    can do something like this:
                    >
                    SELECT * FROM s WHERE soundex(name) = soundex('john') ;
                    SELECT * FROM s WHERE difference(name , 'john') 2;
                    >

                    >
                    SQLite supports soundex, but it is disabled by default, you need to
                    compile it with -DSQLITE_SOUNDEX =1

                    --
                    -- Guilherme H. Polo Goncalves

                    Comment

                    Working...