Match beginning of two strings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Bengt Richter

    #16
    Re: Match beginning of two strings

    On Mon, 04 Aug 2003 11:56:04 GMT, Alex Martelli <aleax@aleax.it > wrote:
    [color=blue]
    >Ravi wrote:
    >[color=green]
    >> Hi,
    >>
    >> I have about 200GB of data that I need to go through and extract the
    >> common first part of a line. Something like this.
    >>[color=darkred]
    >> >>>a = "abcdefghijklmn opqrstuvwxyz"
    >> >>>b = "abcdefghijklmn opBHLHT"
    >> >>>c = extract(a,b)
    >> >>>print c[/color]
    >> "abcdefghijklmn op"
    >>
    >> Here I want to extract the common string "abcdefghijklmn op". Basically I
    >> need a fast way to do that for any two given strings. For my situation,
    >> the common string will always be at the beginning of both strings. I can[/color]
    >
    >Here's my latest study on this:
    >
    >*** pexa.py:
    >[/color]
    [...]

    JFTHOI, if you have the inclination, I'm curious how this slightly
    different 2.3-dependent version would fare in your harness on your
    system with the rest:

    def commonprefix(s1 , s2): # very little tested!
    try:
    for i, c in enumerate(s1):
    if c != s2[i]: return s1[:i]
    except IndexError:
    return s1[:i]
    return s1

    [...]
    [color=blue]
    >
    >and my measurements give me:
    >
    >[alex@lancelot exi]$ python -O timeit.py -s 'import pexa' \[color=green]
    >> 'pexa.extract(" abcdefghijklmon pKOU", "abcdefghijklmo npZE")'[/color]
    >100000 loops, best of 3: 2.39 usec per loop
    >[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
    >'pexa.extract( "abcdefghijklmo npKOU", "abcdefghijklmo npZE")'
    >100000 loops, best of 3: 2.14 usec per loop
    >[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
    >'pexa.extract2 ("abcdefghijklm onpKOU", "abcdefghijklmo npZE")'
    >10000 loops, best of 3: 30.2 usec per loop
    >[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
    >'pexa.extract3 ("abcdefghijklm onpKOU", "abcdefghijklmo npZE")'
    >100000 loops, best of 3: 9.59 usec per loop
    >[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
    >'pexa.extract_ pyrex("abcdefgh ijklmonpKOU", "abcdefghijklmo npZE")'
    >10000 loops, best of 3: 21.8 usec per loop
    >[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
    >'pexa.extract_ c("abcdefghijkl monpKOU", "abcdefghijklmo npZE")'
    >100000 loops, best of 3: 1.88 usec per loop
    >[alex@lancelot exi]$
    >[/color]
    Interesting, but I think I will have to write a filter so I can
    see a little more easily what your timeit.py outputs say ;-)

    Regards,
    Bengt Richter

    Comment

    • Ravi

      #17
      Re: Match beginning of two strings

      Ravi wrote:[color=blue]
      > Hi,
      >
      > I have about 200GB of data that I need to go through and extract the
      > common first part of a line. Something like this.
      >[color=green][color=darkred]
      > >>>a = "abcdefghijklmn opqrstuvwxyz"
      > >>>b = "abcdefghijklmn opBHLHT"
      > >>>c = extract(a,b)
      > >>>print c[/color][/color]
      > "abcdefghijklmn op"
      >
      > Here I want to extract the common string "abcdefghijklmn op". Basically I
      > need a fast way to do that for any two given strings. For my situation,
      > the common string will always be at the beginning of both strings. I can
      > use regular expressions to do this, but from what I understand there is
      > a lot of overhead. New data is being generated at the rate of about 1GB
      > per hour, so this needs to be reasonably fast while leaving CPU time for
      > other processes.
      >
      > Thanks
      > Ravi
      >[/color]

      I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
      and Bengt. However I have this problem taken care of now. Took around 6
      hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
      for the data, if you want to know about it just for the sake of an
      optimized algorithm, there are no Null (\0) characters in the strings
      (actually they're Base64), and I've included a typical pair of strings.
      The version I used was Andrew's.

      Someone suggested that this would be better done in larger sets than
      just pairs. That's not suitable because of the structure of the data,
      two strings might be highly correlated, but are probably quite different
      from another pair of strings. Perhaps more significantly, correlation in
      sets of greater than two has no physical significance to the experiment.

      I grabbed this from a typical data file. So I would want to be
      extracting 'A832nv81a'
      "
      A832nv81a81nW10 3v9c24jgpy92T
      A832nv81aTyqiep 4v9c324jgpy92T
      "

      Thanks for your help everyone, coming from a Perl (It's a four letter
      word to me :) world, I'm very impressed by how helpful all of you are.

      Ravi

      Comment

      • John Machin

        #18
        Re: Match beginning of two strings

        Richie Hindle <richie@entrian .com> wrote in message news:<mailman.1 060011610.28906 .python-list@python.org >...[color=blue]
        >
        > for C strings. There's another similar optimisation that the C
        > output leads you to: you can use strlen rather than Python's len:
        >[/color]

        You can, if you don't care about the possibility that the input may contain NULs.

        Comment

        • Bengt Richter

          #19
          Re: Match beginning of two strings

          On Mon, 04 Aug 2003 18:53:27 -0400, Ravi <rxs141@cwru.ed u> wrote:
          [color=blue]
          >Ravi wrote:[color=green]
          >> Hi,
          >>
          >> I have about 200GB of data that I need to go through and extract the
          >> common first part of a line. Something like this.
          >>[color=darkred]
          >> >>>a = "abcdefghijklmn opqrstuvwxyz"
          >> >>>b = "abcdefghijklmn opBHLHT"
          >> >>>c = extract(a,b)
          >> >>>print c[/color]
          >> "abcdefghijklmn op"
          >>
          >> Here I want to extract the common string "abcdefghijklmn op". Basically I
          >> need a fast way to do that for any two given strings. For my situation,
          >> the common string will always be at the beginning of both strings. I can
          >> use regular expressions to do this, but from what I understand there is
          >> a lot of overhead. New data is being generated at the rate of about 1GB
          >> per hour, so this needs to be reasonably fast while leaving CPU time for
          >> other processes.
          >>
          >> Thanks
          >> Ravi
          >>[/color]
          >
          >I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
          >and Bengt. However I have this problem taken care of now. Took around 6
          >hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
          >for the data, if you want to know about it just for the sake of an
          >optimized algorithm, there are no Null (\0) characters in the strings
          >(actually they're Base64), and I've included a typical pair of strings.
          >The version I used was Andrew's.
          >
          >Someone suggested that this would be better done in larger sets than
          >just pairs. That's not suitable because of the structure of the data,
          >two strings might be highly correlated, but are probably quite different
          >from another pair of strings. Perhaps more significantly, correlation in
          >sets of greater than two has no physical significance to the experiment.
          >
          >I grabbed this from a typical data file. So I would want to be
          >extracting 'A832nv81a'
          >"
          >A832nv81a81nW1 03v9c24jgpy92T
          >A832nv81aTyqie p4v9c324jgpy92T
          >"[/color]
          I still don't understand your use case ;-)

          1) Are you periodically batch processing to look for near-pairs in a 200GB
          unsorted list of strings? Are they just in a huge unsorted ascii file with
          line feeds delimiting?

          or

          1a) Does someone send you a single string in a message every now and then (how often?)
          and ask you to find the closest match in the data base? I.e., 200GB at 30 bytes/string
          would be 6.67 billion strings. Which you say you now crunch in ~6hrs? You don't do that
          amount of crunching just for one match request, do you?

          1b) If you get a lot of "match requests," wouldn't you gain a lot by at least partially
          ordering the incoming streams into separate buckets? E.g., the simplest thing would be
          to have 64 files corresponding to the first character, or even 64*64 files for the first two.
          Then have 64*64 file buffers (not open files) of say 1000 strings each, which would be
          64*64*1000*30 or call it 32k/file buffer -> 64*64*32*1024/(1024*1024) ->128 megabytes of buffers
          and on the average you'd expect 1GB/hr to fill a buffer of 32k about
          (1e9/3600)/(32*1024) -> 8.477 times/second. It should be reasonable to open a file for append
          and write 32k and close it that often, IWT. Whateve box writes the 200GB data now could either
          do the partitioning itself or send it by gigabit ethernet to a dedicated box for that, IWT. Or
          you could distribute the load by patitioning the ethernet destinations per the leading n bits
          and let 2**n boxes maybe do even more ordering on the fly.

          Even without distributing the load, this could give you 200GB/4096 (if evenly distributed!!)
          or only '%e'%(200e9/4096) ->4.882813e+00 7 or less than 50MB to read off disk and search
          to make a probe with a single string. If you sorted when you probed and wrote back the
          sorted data with an indication that it was sorted, further probes to that same partition could
          go a lot faster.

          2) Why do the string lengths vary if they are samples from some process? Are they
          effectively just right-stripped from some fixed length, which would be a guaranteed max length?
          2a) what are the max and min lengths?

          3) what is the distribution of leading characters as they come from the source?
          3a) E.g., is the first character uniformly distributed within its possible Base64 codes?
          3b) Are they uncorrelated timewise? Or e.g. do they arrive in clumps of <<64 possibilities for a time?
          3c) Are the individual strings' characters randomly distributed and uncorrelated within the string?
          3d) You say 9000GB compresses to 700GB. That suggests a lot of correlation and/or uneven distributions.
          Is there a fixed dictionary you could use to repack the strings bit-wise with huffman and/or
          rle compression?
          3d1) What is the significance of the Base64 character boundaries? I.e., would a common prefix of
          8-bit bytes representing sequentially compressed string be good enough, even if the first non-
          matching byte contained some bits that did match and would have been included in base64?

          3d2) Note that compressing during partitioned 4096-bucket storage could save a lot of space
          as well as speed up matching.

          4) 1GB/hr translates to about 10k strings like the your example per second.
          4a) Are they just logged sequentially to a 200GB data store? (cf. 1b)
          4b==2a) Is there a max and/or min length to these strings? (the above are 29 & 30 chars).
          4c) You say they are base 64. That means binary would make (6/8)*200gb = 150GB, and the
          strings (6/8)*30 ~ 22.5 or say 24 for a nice multiple of 8, even without compression.

          Just some thoughts. Might want to check my arithmetic ;-)
          [color=blue]
          >
          >Thanks for your help everyone, coming from a Perl (It's a four letter
          >word to me :) world, I'm very impressed by how helpful all of you are.
          >[/color]
          Some newsgroups are like that. In the past I spent a fair amount of time on the Borland
          Delphi n.g., and that was mostly very friendly and helpful too. I think that's just the
          way people are unless some stupidities get in the way. Renaissance yes! Armageddon no!

          Regards,
          Bengt Richter

          Comment

          • Mañungo

            #20
            Re: Match beginning of two strings

            "Andrew Dalke" <adalke@mindspr ing.com> wrote in message news:<bgi9g8$dc 0$1@slb6.atl.mi ndspring.net>.. .[color=blue]
            > Ravi:[color=green]
            > > Read in both strings.
            > > Check to see if the first character matches.
            > > If yes:
            > > Check halfway through the string and see if that character matches
            > > Repeatedly check halfway until the difference point is found.
            > > Go back through from the difference point backwards and make sure
            > > the characters match from the start to the difference point.
            > >
            > > I timed it, and it seems to be doing about 3.5usec per loop.[/color]
            >
            > There's a lot of overhead for doing that. Have you tried the simple
            >
            > char *s1 = ... the first string ..
            > char *s2 = ... the second string ..
            > n = ... the shorter of the two ..
            > for(i=0; i<n; i++) {
            > if (*s1++ != *s2++) {
            > break;
            > }
            > }
            > return ... the string s1[:n] (or even just the int)
            >
            > Easy to understand, and the CPU is spending almost its whole
            > time doing character tests.
            >
            > Andrew
            > dalke@dalkescie ntific.com[/color]

            First, I'm not a python programmer... but I think is better test int.
            Something like that:

            int *a = (int *) ...first string...;
            int *b = (int *) ...second string...;

            for( i = 0; i < n; i += 4 )
            if( *a++ != *b++ )
            break;
            char *aa = (char *) a;
            char *bb = (char *) b;
            if( *aa++ == *bb++ ) i++;
            if( *aa++ == *bb++ ) i++;
            if( *aa == *bb ) i++;

            and return i.

            Comment

            • Andrew Dalke

              #21
              Re: Match beginning of two strings

              Mañungo:[color=blue]
              > First, I'm not a python programmer... but I think is better test int.
              > Something like that:
              >
              > int *a = (int *) ...first string...;
              > int *b = (int *) ...second string...;[/color]

              Sure. That's an old trick. However, your code assumes your
              strings are word aligned. (Perhaps for Python they are but
              in general you cannot make that assumption about C char *s)

              You also have an off-by-one/off-by-four error. Suppose the
              two strings are "A", that is, an 'A' followed by a NUL. Then

              for (i=0; i<n; i+=4)
              if (*a++ != *b++)
              break

              will compare the four bytes and increment the counters.
              You then do

              char *aa = (char *) a;

              and work with the characters *after* the first four characters
              tested in the int compare.

              Finally, you assume ints are 4 bytes long, which is
              not universal.

              There's something to be said for simplicity. I also
              wonder if modern optimizing compliers could figure
              out some of this automatically, but I don't wonder
              enough to find out.

              Andrew
              dalke@dalkescie ntific.com


              Comment

              Working...