optimization pointers?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • r.e.s.

    optimization pointers?

    Would anyone care to offer pointers on how the following code
    might be optimized? Or alternatives? ...

    ---
    def lz(string):
    """ Return the Lempel-Ziv complexity (LZ78) of string. """

    vocabulary = []
    word = ''
    for char in string:
    word += char
    if word not in vocabulary:
    vocabulary.appe nd(word)
    word = ''
    return len(vocabulary)
    ---

    On my 2 GHz system running PythonWin, the above code averages
    about 1.3 hrs to process a 1 MB string. I'd hoped to compute
    LZ78 for multi-GB files, but it seems infeasible. (?)

    Thanks for any feedback.
    --
    r.e.s.
  • Mel Wilson

    #2
    Re: optimization pointers?

    In article <Bz5Cb.7881$_r6 .5116@newsread1 .news.pas.earth link.net>,
    "r.e.s." <r.s@XXmindspri ng.com> wrote:[color=blue]
    >Would anyone care to offer pointers on how the following code
    >might be optimized? Or alternatives? ...
    >
    >---
    >def lz(string):
    > """ Return the Lempel-Ziv complexity (LZ78) of string. """
    >
    > vocabulary = []
    > word = ''
    > for char in string:
    > word += char
    > if word not in vocabulary:
    > vocabulary.appe nd(word)
    > word = ''
    > return len(vocabulary)
    >---
    >
    >On my 2 GHz system running PythonWin, the above code averages
    >about 1.3 hrs to process a 1 MB string. I'd hoped to compute
    >LZ78 for multi-GB files, but it seems infeasible. (?)[/color]

    One simple improvement is to get rid of

    if an_item not in a_list

    which does a linear search over the list. A dictionary is faster.

    Without a full benchmark, it seems that appending


    def lz_complexity (s):
    voc = {}
    word = ''
    for c in s:
    word += c
    if not voc.get (word, False):
    voc[word] = True
    word = ''
    return len (voc.keys())

    if __name__ == '__main__':
    import time
    text = file ('lzc.py', 'rt').read()
    t0 = time.clock()
    t0 = time.clock() - t0

    c = lz (text)
    t1 = time.clock()
    c = lz (text)
    t1 = time.clock() - t1

    c = lz_complexity (text)
    t2 = time.clock()
    c = lz_complexity (text)
    t2 = time.clock() - t2

    print 'T1:', t1 - t0
    print 'T2:', t2 - t0


    will show about a x6 speedup.

    Avoiding

    a_string += another_string

    is a traditional speedup, but it seems you'll be needing the
    string form for every character processed, so it may not buy
    much in this case.

    Regards. Mel.

    Comment

    • Anton Vredegoor

      #3
      Re: optimization pointers?

      "r.e.s." <r.s@XXmindspri ng.com> wrote:
      [color=blue]
      >Would anyone care to offer pointers on how the following code
      >might be optimized? Or alternatives? ...
      >
      >---
      >def lz(string):
      > """ Return the Lempel-Ziv complexity (LZ78) of string. """
      >
      > vocabulary = []
      > word = ''
      > for char in string:
      > word += char
      > if word not in vocabulary:
      > vocabulary.appe nd(word)
      > word = ''
      > return len(vocabulary)
      >---
      >
      >On my 2 GHz system running PythonWin, the above code averages
      >about 1.3 hrs to process a 1 MB string. I'd hoped to compute
      >LZ78 for multi-GB files, but it seems infeasible. (?)[/color]

      On a 300 mhz celeron the script below takes about ten or twenty
      seconds, the original lz is commented out as it never seems to finish
      for longer strings.

      Anton

      p.s. the file I used is here:


      import sets

      def lz(string):
      """ Return the Lempel-Ziv complexity (LZ78) of string. """

      vocabulary = []
      word = ''
      for char in string:
      word += char
      if word not in vocabulary:
      vocabulary.appe nd(word)
      word = ''
      return len(vocabulary)

      def lzx(string):
      """ Return the Lempel-Ziv complexity (LZ78) of string. """

      vocabulary = sets.Set()
      word = ''
      for char in string:
      word += char
      if word not in vocabulary:
      vocabulary.add( word)
      word = ''
      return len(vocabulary)

      def test():
      fn = '1donq10.txt'
      s = file(fn,'rb').r ead(1000000)
      print lzx(s)
      #print lz(s)


      if __name__=='__ma in__':
      test()





      Comment

      • r.e.s.

        #4
        Re: optimization pointers?

        "Mel Wilson" <mwilson@the-wire.com> wrote ...
        [color=blue]
        > One simple improvement is to get rid of
        >
        > if an_item not in a_list
        >
        > which does a linear search over the list.
        > A dictionary is faster.
        >
        > Without a full benchmark, it seems that[/color]
        [...][color=blue]
        > will show about a x6 speedup.[/color]

        I really appreciate the suggestions -- both yours
        and Anton's.

        After making the changes you indicated, the code
        runs in about 1/500th the time of the original
        (i.e. about 2 sec per MB for strings in RAM). The
        sets idea also speeds things up tremendously, but
        not as much -- it takes about 70% longer than the
        dictionary approach.

        Thanks again.
        --
        r.e.s.



        Comment

        • Anthony McDonald

          #5
          Re: optimization pointers?

          "Mel Wilson" <mwilson@the-wire.com> wrote in message
          news:rzO2/ks/KXcX089yn@the-wire.com...[color=blue]
          >
          > One simple improvement is to get rid of
          >
          > if an_item not in a_list
          >
          > which does a linear search over the list. A dictionary is faster.
          >
          > Without a full benchmark, it seems that appending
          >
          >
          > def lz_complexity (s):
          > voc = {}
          > word = ''
          > for c in s:
          > word += c
          > if not voc.get (word, False):
          > voc[word] = True
          > word = ''
          > return len (voc.keys())
          >
          > if __name__ == '__main__':
          > import time
          > text = file ('lzc.py', 'rt').read()
          > t0 = time.clock()
          > t0 = time.clock() - t0
          >
          > c = lz (text)
          > t1 = time.clock()
          > c = lz (text)
          > t1 = time.clock() - t1
          >
          > c = lz_complexity (text)
          > t2 = time.clock()
          > c = lz_complexity (text)
          > t2 = time.clock() - t2
          >
          > print 'T1:', t1 - t0
          > print 'T2:', t2 - t0
          >
          >
          > will show about a x6 speedup.
          >
          > Avoiding
          >
          > a_string += another_string
          >
          > is a traditional speedup, but it seems you'll be needing the
          > string form for every character processed, so it may not buy
          > much in this case.
          >
          > Regards. Mel.[/color]

          Heres my contribution, removing the string append in favour of slices. Buys
          a little speed thanks to xrange.

          def lz_comp_mine(te xt):
          voc = {}
          st = 0
          for cur in xrange(1,len(te xt)+1):
          if not voc.get(text[st:cur], False):
          voc[text[st:cur]] = True
          st = cur
          return len(voc)

          Anthony McDonald


          Comment

          • Mel Wilson

            #6
            Re: optimization pointers?

            In article <3fd9cdf1$0$697 0$7a628cd7@news .club-internet.fr>,
            "Anthony McDonald" <tonym1972(at)c lub-internet(in)fr> wrote:[color=blue]
            >Heres my contribution, removing the string append in favour of slices. Buys
            >a little speed thanks to xrange.
            >
            >def lz_comp_mine(te xt):
            > voc = {}
            > st = 0
            > for cur in xrange(1,len(te xt)+1):
            > if not voc.get(text[st:cur], False):
            > voc[text[st:cur]] = True
            > st = cur
            > return len(voc)[/color]

            Nice touch! I tried slices and took a huge performance
            hit (almost 3x the list version) but I didn't use `for ... in
            xrange ...`. It must have all been in the while-loop test
            and index incrementing.

            Regards. Mel.

            Comment

            • Anthony McDonald

              #7
              Re: optimization pointers?

              "Mel Wilson" <mwilson@the-wire.com> wrote in message
              news:xIf2/ks/Kvyd089yn@the-wire.com...[color=blue]
              >
              > Nice touch! I tried slices and took a huge performance
              > hit (almost 3x the list version) but I didn't use `for ... in
              > xrange ...`. It must have all been in the while-loop test
              > and index incrementing.
              >
              > Regards. Mel.[/color]

              Thanks.

              My original attempt which incidentally used range was about half a second
              slower than yours with 700K of test data. Just as I was about to close the
              editor window I noticed your return statement.

              return len (voc.keys())

              Which creates a list of keys to then apply len to, but thats an unneeded
              step as len applied directly to a dictionary returns the number of keys. I
              made the change and gained 7 hundreds of a second. Not much, still behind
              yours, but it suggested that chainging range to xrange and avoiding the list
              creation might help. Viola!

              The interesting thing that benchmarking with test data shows, is that the
              difference in speed between our 2 routines is about 0.05secs per 700K
              processed. That difference remains constant at least upto 80Mb of input
              data, implying that slices are no more efficent than string appending, and
              may actually in this case be less efficent.

              Anthony McDonald


              Comment

              • Anton Vredegoor

                #8
                Re: optimization pointers?

                "Anthony McDonald" <tonym1972(at)c lub-internet(in)fr> wrote:
                [color=blue]
                >return len (voc.keys())
                >
                >Which creates a list of keys to then apply len to, but thats an unneeded
                >step as len applied directly to a dictionary returns the number of keys.[/color]

                Below are two more small optimization possibilities. Originally I
                wasn't going to post them because they are way into micro optimization
                country and because the setdefault solution is beautiful but blond. At
                least on some of my machines however they are both faster than the
                solutions offered so far.

                Anton

                def lzx(s):
                word,D = '',{}
                Dget, Dset = D.get, D.__setitem__
                for c in s:
                word += c
                if not Dget(word):
                Dset(word,True)
                word = ''
                return len(D)

                def lzy(s):
                j,D = 0,{}
                func = D.setdefault
                for i in xrange(1,len(s) +1):
                if func(s[j:i],j) is j: j = i
                return len(D)

                Comment

                Working...