Some optimization tale

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Stephan Diehl

    Some optimization tale

    A while ago, I've posted a recipie about finding a common prefix to a list
    of strings. While the recipie itself is quite bad (I have to admit) and I
    didn't know at that time that this problem was solved already in the
    os.path module.
    The interesting part can be found in the commentaries, as this turned out to
    be a quest for the most efficient algorithm to solve this particular
    problem.
    All of this can be found at


    What I was most surprised at was the inefficency of the trivial solutions
    (and that the right algorithm makes indeed a difference).

    If you have read that far, you might be interested in the actual algorithms.
    (If you are interested in the one-liner versions, please have a look at the
    recipie)
    Here they are:

    -- os.path.commonp refix --------------------------------------------------

    def f1(m):
    "Given a list of pathnames, returns the longest common leading
    component"
    if not m: return ''
    prefix = m[0]
    for item in m:
    for i in range(len(prefi x)):
    if prefix[:i+1] != item[:i+1]:
    prefix = prefix[:i]
    if i == 0:
    return ''
    break
    return prefix

    ---------------------------------------------------------------------------

    The problem with this algorithm is the copying of all those small strings.
    This can be easily fixed

    --- optimized os.path.commonp refix ----------------------------------------

    def f2(m):
    "Given a list of pathnames, returns the longest common leading
    component"
    if not m: return ''
    if len(m) == 1: return m[0]
    prefix = m[0]
    for i in xrange(len(pref ix)):
    for item in m[1:]:
    if prefix[i] != item[i]:
    return prefix[:i]
    return prefix[:i]

    ---------------------------------------------------------------------------

    Now it gets interesting. It turns out that the above algorithms doesn't
    scale well.

    Some anonymous submitter suggested the following

    ---- by anonymous ---------------------------------------------------------

    def f3(seq):
    if not seq:return ""
    seq.sort()
    s1, s2 = seq[0], seq[-1]
    l = min(len(s1), len(s2))
    if l == 0 :
    return ""
    for i in xrange(l) :
    if s1[i] != s2[i] :
    return s1[0:i]
    return s1[0:l]

    ---------------------------------------------------------------------------

    It is just not nessesary to compare all strings in the list. It is enough to
    sort the list first and then compare the first and the last element. Even
    though the 'sort' algorithm is coded in C and is therefore quite fast, the
    order of runtime has changed.

    Michael Dyck then pointed out that instead of using 'sort', 'min' and 'max'
    should be used. While tests suggest that this is true, I have no idea why
    that should be, since finding a minimum or maximum uses some sorting anyway
    (if we don't have some quantum computer at our hands), so, my reasoning
    would be that sorting once should be faster than computing both, maximum
    and minimum.

    You might have realized that the optimization so far was done one the number
    of strings. There is still another dimension to optimize in and that is the
    actual string comparing.
    Raymond Hettinger suggests using a binary search:

    ---------------------------------------------------------------------------
    def f4(m):
    "Given a list of pathnames, returns the longest common leading
    component"
    if not m: return ''
    a, b = min(m), max(m)
    lo, hi = 0, min(len(a), len(b))
    while lo < hi:
    mid = (lo+hi)//2 + 1
    if a[lo:mid] == b[lo:mid]:
    lo = mid
    else:
    hi = mid - 1
    return a[:hi]
    ----------------------------------------------------------------------------


    To give you some ideas about the running times:

    f1 f3 f4 # of strings

    ['0.131058', '0.021471', '0.012050'] 2
    ['0.214896', '0.041648', '0.012963'] 4
    ['0.401236', '0.020444', '0.014707'] 8
    ['0.841738', '0.026415', '0.018589'] 16
    ['1.670606', '0.039348', '0.029020'] 32
    ['3.184446', '0.065657', '0.044247'] 64
    ['6.257635', '0.123510', '0.072568'] 128


    Every calculation was done 200 times.
    Furthermore, the testset consists of only two different strings, so the
    binary search part of Raymonds solution comes only in as a static factor.

    Anyway, the fastest solution is up to a 100 times faster than the trivial
    one.

    Cheers

    Stephan

  • Terry Reedy

    #2
    Re: Some optimization tale


    "Stephan Diehl" <stephan.diehlN OSPAM@gmx.net> wrote in message
    news:bskcb0$36n $00$1@news.t-online.com...[color=blue]
    > All of this can be found at
    > http://aspn.activestate.com/ASPN/Coo.../Recipe/252177
    >
    > What I was most surprised at was the inefficency of the trivial solutions
    > (and that the right algorithm makes indeed a difference).[/color]

    A common surprise. The science of algorithms (including empirical testing)
    gives real benefits.
    [color=blue]
    > --[/color]
    os.path.commonp refix --------------------------------------------------[color=blue]
    >
    > def f1(m):
    > "Given a list of pathnames, returns the longest common leading
    > component"
    > if not m: return ''
    > prefix = m[0][/color]

    prefix = m.pop() # avoids comparing prefix to itself as first item
    [color=blue]
    > for item in m:
    > for i in range(len(prefi x)):
    > if prefix[:i+1] != item[:i+1]:[/color]

    I am 99% sure that above is a typo and performance bug that should read
    if prefix[i:i+1] != item[i:i+1]:
    since all previous chars have been found equal in previous iteration.
    [color=blue]
    > prefix = prefix[:i]
    > if i == 0:
    > return ''
    > break
    > return prefix[/color]

    Perhaps you could retest this with the two suggested changes. It will be
    somewhat faster.
    [color=blue]
    > The problem with this algorithm is the copying of all those small[/color]
    strings.

    Since Python does not have pointers, comparing characters within strings
    requires pulling out the chars as separate strings. This is why using
    C-coded comparison functions may win even though more comparisons are done.

    The reason f1uses slicing (of len 1 after my correction) instead of
    indexing is to avoid exceptions when len(item) < len(prefix). However, all
    the +1s have a cost (and I suspect slicing does also), so it may pay to
    truncate prefix to the length of item first. The simplest fix for this
    (untested) gives

    def f9(m): # modified f1 == os.path.commonp refix
    "Given a list of strings, returns the longest common prefix"
    if not m: return ''
    prefix = m.pop()
    for item in m:
    prefix = prefix[:len(item)]
    for i in range(len(prefi x)):
    if prefix[i] != item[i]:
    if not i: return ''
    prefix = prefix[:i]
    break
    return prefix
    [color=blue]
    > This can be easily fixed
    > --- optimized[/color]
    os.path.commonp refix ----------------------------------------[color=blue]
    >
    > def f2(m):
    > "Given a list of pathnames, returns the longest common leading
    > component"
    > if not m: return ''
    > if len(m) == 1: return m[0]
    > prefix = m[0]
    > for i in xrange(len(pref ix)):
    > for item in m[1:]:
    > if prefix[i] != item[i]:
    > return prefix[:i]
    > return prefix[:i][/color]

    and easily messed up;-) If len(item) < len(prefix), item[i] throws
    exception. For this approach to work, prefix should be set as shortest
    string of m in preliminary loop. Also, you reslice m several times. Do it
    once before outer loop.
    [color=blue]
    > It is just not nessesary to compare all strings in the list.[/color]

    Every string has to be compared to something at least once.
    [color=blue]
    > It is enough to sort the list first
    > and then compare the first and the last element.[/color]

    Sorting compares all strings in the list to something at least once, and
    most more than once.
    [color=blue]
    > Even though the 'sort' algorithm is coded in C and is therefore quite[/color]
    fast,[color=blue]
    > the order of runtime has changed.[/color]

    The C part is what makes f3 faster. In your timings, 128 is not large
    enough for the nlogn component to be noticeable.
    [color=blue]
    > Michael Dyck then pointed out that instead of using 'sort', 'min' and[/color]
    'max'[color=blue]
    > should be used. While tests suggest that this is true, I have no idea why
    > that should be, since finding a minimum or maximum uses some sorting[/color]
    anyway

    No. max and min each do a linear scan. No sorting. But each does at
    least as many character comparisons as modified f1 or f2. The speedup is
    from looping and comparing in C, even though at least twice as many
    compares are done.
    [color=blue]
    > You might have realized that the optimization so far was done one the[/color]
    number[color=blue]
    > of strings. There is still another dimension to optimize in and that is[/color]
    the[color=blue]
    > actual string comparing.
    > Raymond Hettinger suggests using a binary search:[/color]

    Since this only affects the final comparison of min and max, and not the n
    comparisons done to calculate each, the effect is minimal and constant,
    independent of number of strings.

    Since this compares slices rather than chars in each loop, I wonder whether
    this is really faster than linear scan anyway. I would like to see timing
    of f5 with min/max of f4 combined with linear scan of f3. (Basically, f3
    with sort removed and min/max added.) Since you changed two parts of f3 to
    get f4, we cannot be sure that both changes are each an improvement even
    though the combination of two is.

    def f5(seq):
    if not seq: return ''
    s1 = min(seq)
    s2 = max(seq)
    n = min(len(s1), len(s2))
    if not n: return '' # not required since s1[0:0] == ''
    for i in xrange(n) :
    if s1[i] != s2[i] :
    return s1[0:i]
    return s1[0:n]

    Terry J. Reedy


    Comment

    • Stephan Diehl

      #3
      Re: Some optimization tale

      Terry Reedy wrote:

      [...][color=blue]
      >
      > and easily messed up;-) If len(item) < len(prefix), item[i] throws
      > exception. For this approach to work, prefix should be set as shortest
      > string of m in preliminary loop. Also, you reslice m several times. Do
      > it once before outer loop.
      >[color=green]
      >> It is just not nessesary to compare all strings in the list.[/color]
      >
      > Every string has to be compared to something at least once.[/color]

      You are right, of course. I have to admit that I've been too sloppy in my
      descriptions (and too sloppy in my thinking).
      [color=blue]
      >[/color]

      [...]
      [color=blue][color=green]
      >> Michael Dyck then pointed out that instead of using 'sort', 'min' and[/color]
      > 'max'[color=green]
      >> should be used. While tests suggest that this is true, I have no idea why
      >> that should be, since finding a minimum or maximum uses some sorting[/color]
      > anyway
      >
      > No. max and min each do a linear scan. No sorting. But each does at
      > least as many character comparisons as modified f1 or f2. The speedup is
      > from looping and comparing in C, even though at least twice as many
      > compares are done.[/color]

      Makes sense.
      [color=blue]
      >[color=green]
      >> You might have realized that the optimization so far was done one the[/color]
      > number[color=green]
      >> of strings. There is still another dimension to optimize in and that is[/color]
      > the[color=green]
      >> actual string comparing.
      >> Raymond Hettinger suggests using a binary search:[/color]
      >
      > Since this only affects the final comparison of min and max, and not the n
      > comparisons done to calculate each, the effect is minimal and constant,
      > independent of number of strings.
      >
      > Since this compares slices rather than chars in each loop, I wonder
      > whether
      > this is really faster than linear scan anyway. I would like to see timing
      > of f5 with min/max of f4 combined with linear scan of f3. (Basically, f3
      > with sort removed and min/max added.) Since you changed two parts of f3
      > to get f4, we cannot be sure that both changes are each an improvement
      > even though the combination of two is.
      >
      > def f5(seq):
      > if not seq: return ''
      > s1 = min(seq)
      > s2 = max(seq)
      > n = min(len(s1), len(s2))
      > if not n: return '' # not required since s1[0:0] == ''
      > for i in xrange(n) :
      > if s1[i] != s2[i] :
      > return s1[0:i]
      > return s1[0:n][/color]

      Your f5 function runs virtually at the same speed than Raymonds version.
      Even with growing string length, there is no dicernable difference.

      Cheers

      Stephan[color=blue]
      >
      > Terry J. Reedy[/color]

      Comment

      • Christos TZOTZIOY Georgiou

        #4
        Re: Some optimization tale

        On Sat, 27 Dec 2003 17:36:46 +0100, rumours say that Stephan Diehl
        <stephan.diehlN OSPAM@gmx.net> might have written:
        [color=blue]
        >A while ago, I've posted a recipie about finding a common prefix to a list
        >of strings. While the recipie itself is quite bad (I have to admit) and I
        >didn't know at that time that this problem was solved already in the
        >os.path module.
        >The interesting part can be found in the commentaries, as this turned out to
        >be a quest for the most efficient algorithm to solve this particular
        >problem.[/color]

        You might also want to read:


        --
        TZOTZIOY, I speak England very best,
        Ils sont fous ces Redmontains! --Harddix

        Comment

        • Stephan Diehl

          #5
          Re: Some optimization tale

          Christos TZOTZIOY Georgiou wrote:
          [color=blue]
          > On Sat, 27 Dec 2003 17:36:46 +0100, rumours say that Stephan Diehl
          > <stephan.diehlN OSPAM@gmx.net> might have written:
          >[color=green]
          >>A while ago, I've posted a recipie about finding a common prefix to a list
          >>of strings. While the recipie itself is quite bad (I have to admit) and I
          >>didn't know at that time that this problem was solved already in the
          >>os.path module.
          >>The interesting part can be found in the commentaries, as this turned out
          >>to be a quest for the most efficient algorithm to solve this particular
          >>problem.[/color]
          >
          > You might also want to read:
          >
          > http://www.python.org/sf/681780[/color]

          Terry's solution is much faster (at least on my test set) and, as an
          additional benefit, is the easiest to understand.


          Comment

          • Bengt Richter

            #6
            Re: Some optimization tale

            On Sat, 27 Dec 2003 15:23:34 -0500, "Terry Reedy" <tjreedy@udel.e du> wrote:
            [color=blue]
            >
            >"Stephan Diehl" <stephan.diehlN OSPAM@gmx.net> wrote in message
            >news:bskcb0$36 n$00$1@news.t-online.com...[color=green]
            >> All of this can be found at
            >> http://aspn.activestate.com/ASPN/Coo.../Recipe/252177
            >>
            >> What I was most surprised at was the inefficency of the trivial solutions
            >> (and that the right algorithm makes indeed a difference).[/color]
            >
            >A common surprise. The science of algorithms (including empirical testing)
            >gives real benefits.
            >[color=green]
            >> --[/color]
            >os.path.common prefix --------------------------------------------------[color=green]
            >>
            >> def f1(m):
            >> "Given a list of pathnames, returns the longest common leading
            >> component"
            >> if not m: return ''
            >> prefix = m[0][/color]
            >
            > prefix = m.pop() # avoids comparing prefix to itself as first item
            >[color=green]
            >> for item in m:
            >> for i in range(len(prefi x)):
            >> if prefix[:i+1] != item[:i+1]:[/color]
            >
            >I am 99% sure that above is a typo and performance bug that should read
            > if prefix[i:i+1] != item[i:i+1]:
            >since all previous chars have been found equal in previous iteration.
            >[color=green]
            >> prefix = prefix[:i]
            >> if i == 0:
            >> return ''
            >> break
            >> return prefix[/color]
            >
            >Perhaps you could retest this with the two suggested changes. It will be
            >somewhat faster.
            >[color=green]
            >> The problem with this algorithm is the copying of all those small[/color]
            >strings.
            >
            >Since Python does not have pointers, comparing characters within strings
            >requires pulling out the chars as separate strings. This is why using
            >C-coded comparison functions may win even though more comparisons are done.
            >
            >The reason f1uses slicing (of len 1 after my correction) instead of
            >indexing is to avoid exceptions when len(item) < len(prefix). However, all
            >the +1s have a cost (and I suspect slicing does also), so it may pay to
            >truncate prefix to the length of item first. The simplest fix for this
            >(untested) gives
            >
            >def f9(m): # modified f1 == os.path.commonp refix
            > "Given a list of strings, returns the longest common prefix"
            > if not m: return ''
            > prefix = m.pop()
            > for item in m:
            > prefix = prefix[:len(item)]
            > for i in range(len(prefi x)):
            > if prefix[i] != item[i]:
            > if not i: return ''
            > prefix = prefix[:i]
            > break
            > return prefix
            >[color=green]
            >> This can be easily fixed
            >> --- optimized[/color]
            >os.path.common prefix ----------------------------------------[color=green]
            >>
            >> def f2(m):
            >> "Given a list of pathnames, returns the longest common leading
            >> component"
            >> if not m: return ''
            >> if len(m) == 1: return m[0]
            >> prefix = m[0]
            >> for i in xrange(len(pref ix)):
            >> for item in m[1:]:
            >> if prefix[i] != item[i]:
            >> return prefix[:i]
            >> return prefix[:i][/color]
            >
            >and easily messed up;-) If len(item) < len(prefix), item[i] throws
            >exception. For this approach to work, prefix should be set as shortest
            >string of m in preliminary loop. Also, you reslice m several times. Do it
            >once before outer loop.
            >[color=green]
            >> It is just not nessesary to compare all strings in the list.[/color]
            >
            >Every string has to be compared to something at least once.
            >[color=green]
            >> It is enough to sort the list first
            >> and then compare the first and the last element.[/color]
            >
            >Sorting compares all strings in the list to something at least once, and
            >most more than once.
            >[color=green]
            >> Even though the 'sort' algorithm is coded in C and is therefore quite[/color]
            >fast,[color=green]
            >> the order of runtime has changed.[/color]
            >
            >The C part is what makes f3 faster. In your timings, 128 is not large
            >enough for the nlogn component to be noticeable.
            >[color=green]
            >> Michael Dyck then pointed out that instead of using 'sort', 'min' and[/color]
            >'max'[color=green]
            >> should be used. While tests suggest that this is true, I have no idea why
            >> that should be, since finding a minimum or maximum uses some sorting[/color]
            >anyway
            >
            >No. max and min each do a linear scan. No sorting. But each does at
            >least as many character comparisons as modified f1 or f2. The speedup is
            >from looping and comparing in C, even though at least twice as many
            >compares are done.
            >[color=green]
            >> You might have realized that the optimization so far was done one the[/color]
            >number[color=green]
            >> of strings. There is still another dimension to optimize in and that is[/color]
            >the[color=green]
            >> actual string comparing.
            >> Raymond Hettinger suggests using a binary search:[/color]
            >
            >Since this only affects the final comparison of min and max, and not the n
            >comparisons done to calculate each, the effect is minimal and constant,
            >independent of number of strings.
            >
            >Since this compares slices rather than chars in each loop, I wonder whether
            >this is really faster than linear scan anyway. I would like to see timing
            >of f5 with min/max of f4 combined with linear scan of f3. (Basically, f3
            >with sort removed and min/max added.) Since you changed two parts of f3 to
            >get f4, we cannot be sure that both changes are each an improvement even
            >though the combination of two is.
            >
            >def f5(seq):
            > if not seq: return ''
            > s1 = min(seq)
            > s2 = max(seq)
            > n = min(len(s1), len(s2))
            > if not n: return '' # not required since s1[0:0] == ''
            > for i in xrange(n) :
            > if s1[i] != s2[i] :
            > return s1[0:i]
            > return s1[0:n]
            >[/color]
            I wonder about this version for speed (not very tested ;-):
            [color=blue][color=green][color=darkred]
            >>> def f10(m):[/color][/color][/color]
            ... "return longest common prefix of strings in seq"
            ... if not m: return ''
            ... prefix = m.pop()
            ... ssw = str.startswith
            ... for item in m:
            ... while not ssw(item, prefix): prefix = prefix[:-1]
            ... if not prefix: return ''
            ... return prefix
            ...[color=blue][color=green][color=darkred]
            >>> f10('abcd abcd'.split())[/color][/color][/color]
            'abcd'[color=blue][color=green][color=darkred]
            >>> f10('abcd abce'.split())[/color][/color][/color]
            'abc'[color=blue][color=green][color=darkred]
            >>> f10('abcd abcd a'.split())[/color][/color][/color]
            'a'[color=blue][color=green][color=darkred]
            >>> f10('abcd abcd a x'.split())[/color][/color][/color]
            ''

            Regards,
            Bengt Richter

            Comment

            • Christos TZOTZIOY Georgiou

              #7
              Re: Some optimization tale

              On Mon, 29 Dec 2003 17:08:39 +0100, rumours say that Stephan Diehl
              <stephan.diehlN OSPAM@gmx.net> might have written:
              [color=blue][color=green]
              >> You might also want to read:
              >>
              >> http://www.python.org/sf/681780[/color]
              >
              >Terry's solution is much faster (at least on my test set) and, as an
              >additional benefit, is the easiest to understand.[/color]

              Yep, I believe clarity is essential in the library (and that is why my
              patch was obviously not accepted :) Actually IIRC (it's been a long
              since) that code never compares more than once prefixes that have been
              found equal and does not create slices (used buffer() and then switched
              to startswith() for future compatibility), that is why it's a little
              complicated. The main loop runs math.ceil(math. log(N,2)) times, where N
              is min([len(x) for x in argument_list]).

              Anyway, perhaps Terry or you should update the SF patch so that
              os.path.commonp refix becomes faster... the point is to benefit the whole
              Python community, right? :)
              --
              TZOTZIOY, I speak England very best,
              Ils sont fous ces Redmontains! --Harddix

              Comment

              • Stephan Diehl

                #8
                Re: Some optimization tale

                Christos TZOTZIOY Georgiou wrote:
                [color=blue]
                > On Mon, 29 Dec 2003 17:08:39 +0100, rumours say that Stephan Diehl
                > <stephan.diehlN OSPAM@gmx.net> might have written:
                >[color=green][color=darkred]
                >>> You might also want to read:
                >>>
                >>> http://www.python.org/sf/681780[/color]
                >>
                >>Terry's solution is much faster (at least on my test set) and, as an
                >>additional benefit, is the easiest to understand.[/color]
                >
                > Yep, I believe clarity is essential in the library (and that is why my
                > patch was obviously not accepted :) Actually IIRC (it's been a long
                > since) that code never compares more than once prefixes that have been
                > found equal and does not create slices (used buffer() and then switched
                > to startswith() for future compatibility), that is why it's a little
                > complicated. The main loop runs math.ceil(math. log(N,2)) times, where N
                > is min([len(x) for x in argument_list]).
                >
                > Anyway, perhaps Terry or you should update the SF patch so that
                > os.path.commonp refix becomes faster... the point is to benefit the whole
                > Python community, right? :)[/color]

                In principle, yes :-)
                I'm not too sure if commonprefix is used that often and in a way that would
                really be worth the effort to patch the existing code. (I guess, if there
                had been a real need, it would have been done already a long time ago).
                Another thing is that the speed improvement would be largely due to using C
                implemented functions instead of pure Python code (o.k., the lines are
                blury here). In order to understand what's going on, one needs to know what
                kind of functions are C implemented and why, in that particular case, it is
                a good idea, to use them.


                Comment

                Working...