need help on generator...

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Craig Ringer

    #16
    Re: need help on need help on generator...

    On Sat, 2005-01-22 at 10:10 +0100, Alex Martelli wrote:
    [color=blue]
    > The answer for the current implementation, BTW, is "in between" -- some
    > buffering, but bounded consumption of memory -- but whether that tidbit
    > of pragmatics is part of the file specs, heh, that's anything but clear
    > (just as for other important tidbits of Python pragmatics, such as the
    > facts that list.sort is wickedly fast, 'x in alist' isn't, 'x in adict'
    > IS...).[/color]

    A particularly great example when it comes to unexpected buffering
    effects is the file iterator. Take code that reads a header from a file
    using an (implicit) iterator, then tries to read() the rest of the file.
    Taking the example of reading an RFC822-like message into a list of
    headers and a body blob:

    ..>>> inpath = '/tmp/msg.eml'
    ..>>> infile = open(inpath)
    ..>>> for line in infile:
    ..... if not line.strip():
    ..... break
    ..... headers.append( tuple(line.spli t(':',1)))
    ..>>> body = infile.read()


    (By the way, if you ever implement this yourself for real, you should
    probably be hurt - use the 'email' or 'rfc822' modules instead. For one
    thing, reinventing the wheel is rarely a good idea. For another, the
    above code is horrid - in particular it doesn't handle malformed headers
    at all, isn't big on readability/comments, etc.)

    If you run the above code on a saved email message, you'd expect 'body'
    to contain the body of the message, right? Nope. The iterator created
    from the file when you use it in that for loop does internal read-ahead
    for efficiency, and has already read in the entire file or at least a
    chunk more of it than you've read out of the iterator. It doesn't
    attempt to hide this from the programmer, so the file position marker is
    further into the file (possibly at the end on a smaller file) than you'd
    expect given the data you've actually read in your program.

    I'd be interested to know if there's a better solution to this than:

    ..>>> inpath = '/tmp/msg.eml'
    ..>>> infile = open(inpath)
    ..>>> initer = iter(infile)
    ..>>> headers = []
    ..>>> for line in initer:
    ..... if not line.strip():
    ..... break
    ..... headers.append( tuple(line.spli t(':',1)))
    ..>>> data = ''.join(x for x in initer)

    because that seems like a pretty ugly hack (and please ignore the
    variable names). Perhaps a way to get the file to seek back to the point
    last read from the iterator when the iterator is destroyed?

    --
    Craig Ringer

    Comment

    • Francis Girard

      #17
      Re: need help on need help on generator...

      Le samedi 22 Janvier 2005 10:10, Alex Martelli a écrit :[color=blue]
      > Francis Girard <francis.girard @free.fr> wrote:
      > ...
      >[color=green]
      > > But besides the fact that generators are either produced with the new
      > > "yield" reserved word or by defining the __new__ method in a class
      > > definition, I don't know much about them.[/color]
      >
      > Having __new__ in a class definition has nothing much to do with
      > generators; it has to do with how the class is instantiated when you
      > call it. Perhaps you mean 'next' (and __iter__)? That makes instances
      > of the class iterators, just like iterators are what you get when you
      > call a generator.
      >[/color]

      Yes, I meant "next".

      [color=blue][color=green]
      > > In particular, I don't know what Python constructs does generate a
      > > generator.[/color]
      >
      > A 'def' of a function whose body uses 'yield', and in 2.4 the new genexp
      > construct.
      >[/color]

      Ok. I guess I'll have to update to version 2.4 (from 2.3) to follow the
      discussion.
      [color=blue][color=green]
      > > I know this is now the case for reading lines in a file or with the new
      > > "iterator" package.[/color]
      >
      > Nope, besides the fact that the module you're thinking of is named
      > 'itertools': itertools uses a lot of C-coded special types, which are
      > iterators but not generators. Similarly, a file object is an iterator
      > but not a generator.
      >[color=green]
      > > But what else ?[/color]
      >
      > Since you appear to conflate generators and iterators, I guess the iter
      > built-in function is the main one you missed. iter(x), for any x,
      > either raises an exception (if x's type is not iterable) or else returns
      > an iterator.
      >[/color]

      You're absolutly right, I take the one for the other and vice-versa. If I
      understand correctly, a "generator" produce something over which you can
      iterate with the help of an "iterator". Can you iterate (in the strict sense
      of an "iterator") over something not generated by a "generator" ?

      [color=blue][color=green]
      > > Does Craig Ringer answer mean that list
      > > comprehensions are lazy ?[/color]
      >
      > Nope, those were generator expressions.
      >[color=green]
      > > Where can I find a comprehensive list of all the
      > > lazy constructions built in Python ?[/color]
      >
      > That's yet a different question -- at least one needs to add the
      > built-in xrange, which is neither an iterator nor a generator but IS
      > lazy (a historical artefact, admittedly).
      >
      > But fortunately Python's built-ins are not all THAT many, so that's
      > about it.
      >[color=green]
      > > (I think that to easily distinguish lazy
      > > from strict constructs is an absolute programmer need -- otherwise you
      > > always end up wondering when is it that code is actually executed like in
      > > Haskell).[/color]
      >
      > Encapsulation doesn't let you "easily distinguish" issues of
      > implementation. For example, the fact that a file is an iterator (its
      > items being its lines) doesn't tell you if that's internally implemented
      > in a lazy or eager way -- it tells you that you can code afile.next() to
      > get the next line, or "for line in afile:" to loop over them, but does
      > not tell you whether the code for the file object is reading each line
      > just when you ask for it, or whether it reads all lines before and just
      > keeps some state about the next one, or somewhere in between.
      >[/color]

      You're right. I was much more talking (mistakenly) about lazy evaluation of
      the arguments to a function (i.e. the function begins execution before its
      arguments get evaluated) -- in such a case I think it should be specified
      which arguments are "strict" and which are "lazy" -- but I don't think
      there's such a thing in Python (... well not yet as Python get more and more
      akin to FP).
      [color=blue]
      > The answer for the current implementation, BTW, is "in between" -- some
      > buffering, but bounded consumption of memory -- but whether that tidbit
      > of pragmatics is part of the file specs, heh, that's anything but clear
      > (just as for other important tidbits of Python pragmatics, such as the
      > facts that list.sort is wickedly fast, 'x in alist' isn't, 'x in adict'
      > IS...).
      >
      >
      > Alex[/color]

      Thank you

      Francis Girard
      FRANCE

      Comment

      • Craig Ringer

        #18
        Re: need help on need help on generator...

        On Sat, 2005-01-22 at 17:46 +0800, I wrote:
        [color=blue]
        > I'd be interested to know if there's a better solution to this than:
        >
        > .>>> inpath = '/tmp/msg.eml'
        > .>>> infile = open(inpath)
        > .>>> initer = iter(infile)
        > .>>> headers = []
        > .>>> for line in initer:
        > .... if not line.strip():
        > .... break
        > .... headers.append( tuple(line.spli t(':',1)))
        > .>>> data = ''.join(x for x in initer)
        >
        > because that seems like a pretty ugly hack (and please ignore the
        > variable names). Perhaps a way to get the file to seek back to the point
        > last read from the iterator when the iterator is destroyed?[/color]

        And now, answering my own question (sorry):

        Answer: http://tinyurl.com/6avdt

        so we can slightly simplify the above to:

        ..>>> inpath = '/tmp/msg.eml'
        ..>>> infile = open(inpath)
        ..>>> headers = []
        ..>>> for line in infile:
        ..... if not line.strip():
        ..... break
        ..... headers.append( tuple(line.spli t(':',1)))
        ..>>> data = ''.join(x for x in infile)

        at the cost of making it less clear what's going on and having someone
        later go "duh, why isn't he using read() here instead" but can't seem to
        do much more than that.

        Might it be worth providing a way to have file objects seek back to the
        current position of the iterator when read() etc are called? If not, I
        favour the suggestion in the referenced post - file should probably fail
        noisily, or at least emit a warning.

        What are others thoughts on this?

        --
        Craig Ringer

        Comment

        • Alex Martelli

          #19
          Re: need help on need help on generator...

          Francis Girard <francis.girard @free.fr> wrote:
          ...[color=blue][color=green]
          > > A 'def' of a function whose body uses 'yield', and in 2.4 the new genexp
          > > construct.[/color]
          >
          > Ok. I guess I'll have to update to version 2.4 (from 2.3) to follow the
          > discussion.[/color]

          It's worth upgrading even just for the extra speed;-).

          [color=blue][color=green]
          > > Since you appear to conflate generators and iterators, I guess the iter
          > > built-in function is the main one you missed. iter(x), for any x,
          > > either raises an exception (if x's type is not iterable) or else returns
          > > an iterator.[/color]
          >
          > You're absolutly right, I take the one for the other and vice-versa. If I
          > understand correctly, a "generator" produce something over which you can
          > iterate with the help of an "iterator". Can you iterate (in the strict sense
          > of an "iterator") over something not generated by a "generator" ?[/color]

          A generator function (commonly known as a generator), each time you call
          it, produces a generator object AKA a generator-iterator. To wit:
          [color=blue][color=green][color=darkred]
          >>> def f(): yield 23[/color][/color][/color]
          ....[color=blue][color=green][color=darkred]
          >>> f[/color][/color][/color]
          <function f at 0x75fe70>[color=blue][color=green][color=darkred]
          >>> x = f()
          >>> x[/color][/color][/color]
          <generator object at 0x75aa58>[color=blue][color=green][color=darkred]
          >>> type(x)[/color][/color][/color]
          <type 'generator'>

          A generator expression (genexp) also has a result which is a generator
          object:
          [color=blue][color=green][color=darkred]
          >>> x = (23 for __ in [0])
          >>> type(x)[/color][/color][/color]
          <type 'generator'>


          Iterators need not be generator-iterators, by any means. Generally, the
          way to make sure you have an iterator is to call iter(...) on something;
          if the something was already an iterator, NP, then iter's idempotent:
          [color=blue][color=green][color=darkred]
          >>> iter(x) is x[/color][/color][/color]
          True


          That's what "an iterator" means: some object x such that x.next is
          callable without arguments and iter(x) is x.

          Since iter(x) tries calling type(x).__iter_ _(x) [[slight simplification
          here by ignoring custom metaclasses, see recent discussion on python-dev
          as to why this is only 99% accurate, not 100% accurate]], one way to
          code an iterator is as a class. For example:

          class Repeater(object ):
          def __iter__(self): return self
          def next(self): return 23

          Any instance of Repeater is an iterator which, as it happens, has just
          the same behavior as itertools.repea t(23), which is also the same
          behavior you get from iterators obtained by calling:

          def generepeat():
          while True: yield 23

          In other words, after:

          a = Repeater()
          b = itertools.repea t(23)
          c = generepeat()

          the behavior of a, b and c is indistinguishab le, though you can easily
          tell them apart by introspection -- type(a) != type(b) != type(c).


          Python's penchant for duck typing -- behavior matters more, WAY more
          than implementation details such as type() -- means we tend to consider
          a, b and c fully equivalent. Focusing on ``generator'' is at this level
          an implementation detail.


          Most often, iterators (including generator-iterators) are used in a for
          statement (or equivalently a for clause of a listcomp or genexp), which
          is why one normally doesn't think about built-in ``iter'': it's called
          automatically by these ``for'' syntax-forms. In other words,

          for x in <<<whatever>> >:
          ...body...

          is just like:

          __tempiter = iter(<<<whateve r>>>)
          while True:
          try: x = __tempiter.next ()
          except StopIteration: break
          ...body...

          ((Simplificatio n alert: the ``for'' statement has an optional ``else''
          which this allegedly "just like" form doesn't mimic exactly...))

          [color=blue]
          > You're right. I was much more talking (mistakenly) about lazy evaluation of
          > the arguments to a function (i.e. the function begins execution before its
          > arguments get evaluated) -- in such a case I think it should be specified
          > which arguments are "strict" and which are "lazy" -- but I don't think
          > there's such a thing in Python (... well not yet as Python get more and more
          > akin to FP).[/color]

          Python's strict that way. To explicitly make some one argument "lazy",
          sorta, you can put a "lambda:" in front of it at call time, but then you
          have to "call the argument" to get it evaluated; a bit of a kludge.
          There's a PEP out to allow a ``prefix :'' to mean just the same as this
          "lambda:", but even though I co-authored it I don't think it lowers the
          kludge quotient by all that much.

          Guido, our beloved BDFL, is currently musing about optional typing of
          arguments, which might perhaps open a tiny little crack towards letting
          some arguments be lazy. I don't think Guido wants to go there, though.

          My prediction is that even Python 3000 will be strict. At least this
          makes some things obvious at each call-site without having to study the
          way a function is defined, e.g., upon seeing
          f(a+b, c*d)
          you don't have to wonder, or study the ``def f'', to find out when the
          addition and the multiplication happen -- they happen before f's body
          gets a chance to run, and thus, in particular, if either operation
          raises an exception, there's nothing f can do about it.

          And that's a misunderstandin g I _have_ seen repeatedly even in people
          with a pretty good overall grasp of Python, evidenced in code such as:
          self.assertRais es(SomeError, f(23))
          with astonishment that -- if f(23) does indeed raise SomeError -- this
          exception propagates, NOT caught by assertRaises; and if mistakenly
          f(23) does NOT raise, you typically get a TypeError about None not being
          callable. The way to do the above call, _since Python is strict_, is:
          self.assertRais es(SomeError, f, 23)
          i.e. give assertRaises the function (or other callable) and arguments to
          pass to it -- THIS way, assertRaises performs the call under its own
          control (inside the try clause of a try/except statement) and can and
          does catch and report things appropriately.

          The frequency of this misunderstandin g is high enough to prove to me
          that strictness is not FULLY understood by some "intermedia te level"
          Pythonistas. However, I doubt the right solution is to complicate
          Python with the ability to have some arguments be strict, and other
          lazy, much as sometimes one might yearn for it. _Maybe_ one could have
          some form that makes ALL arguments lazy, presenting them as an iterator
          to the function itself. But even then the form _should_, I suspect, be
          obvious at the call site, rather than visible only in the "def"...



          Alex

          Comment

          • Alex Martelli

            #20
            Re: need help on need help on generator...

            Craig Ringer <craig@postnews papers.com.au> wrote:
            [color=blue]
            > .>>> data = ''.join(x for x in infile)[/color]

            Maybe ''.join(infile) is a better way to express this functionality?
            Avoids 2.4 dependency and should be faster as well as more concise.

            [color=blue]
            > Might it be worth providing a way to have file objects seek back to the
            > current position of the iterator when read() etc are called? If not, I[/color]

            It's certainly worth doing a patch and see what the python-dev crowd
            thinks of it, I think; it might make it into 2.5.
            [color=blue]
            > favour the suggestion in the referenced post - file should probably fail
            > noisily, or at least emit a warning.[/color]

            Under what conditions, exactly, would you want such an exception?


            Alex

            Comment

            • Craig Ringer

              #21
              Re: need help on need help on generator...

              On Sat, 2005-01-22 at 12:20 +0100, Alex Martelli wrote:[color=blue]
              > Craig Ringer <craig@postnews papers.com.au> wrote:
              >[color=green]
              > > .>>> data = ''.join(x for x in infile)[/color]
              >
              > Maybe ''.join(infile) is a better way to express this functionality?
              > Avoids 2.4 dependency and should be faster as well as more concise.[/color]

              Thanks - for some reason I hadn't clicked to that. Obvious in hindsight,
              but I just completely missed it.
              [color=blue][color=green]
              > > Might it be worth providing a way to have file objects seek back to the
              > > current position of the iterator when read() etc are called? If not, I[/color]
              >
              > It's certainly worth doing a patch and see what the python-dev crowd
              > thinks of it, I think; it might make it into 2.5.[/color]

              I'll certainly look into doing so. I'm not dumb enough to say "Sure,
              I'll do that" before looking into the code involved and thinking more
              about what issues could pop up. Still, it's been added to my
              increasingly frightening TODO list.
              [color=blue][color=green]
              > > favour the suggestion in the referenced post - file should probably fail
              > > noisily, or at least emit a warning.[/color]
              >
              > Under what conditions, exactly, would you want such an exception?[/color]

              When read() or other methods suffering from the same issue are called
              after next() without an intervening seek(). It'd mean another state flag
              for the file to keep track of - but I doubt it'd make any detectable
              difference in performance given that there's disk I/O involved.

              I'd be happier to change the behaviour so that a warning isn't
              necessary, though, and I suspect it can be done without introducing
              backward compatibility issues. Well, so long as nobody is relying on the
              undefined file position after using an iterator - and I'm not dumb
              enough to say nobody would ever do that.

              I've really got myself into hot water now though - I'm going to have to
              read over the `file' source code before impulsively saying anything
              REALLY stupid.

              --
              Craig Ringer

              Comment

              • Francis Girard

                #22
                Classical FP problem in python : Hamming problem

                Hi,

                First,

                My deepest thanks to Craig Ringer, Alex Martelli, Nick Coghlan and Terry Reedy
                for having generously answered on the "Need help on need help on generator"
                thread. I'm compiling the answers to sketch myself a global pictures about
                iterators, generators, iterator-generators and laziness in python.

                In the meantime, I couldn't resist to test the new Python features about
                laziness on a classical FP problem, i.e. the "Hamming" problem.

                The name of the game is to produce the sequence of integers satisfying the
                following rules :

                (i) The list is in ascending order, without duplicates
                (ii) The list begins with the number 1
                (iii) If the list contains the number x, then it also contains the numbers
                2*x, 3*x, and 5*x
                (iv) The list contains no other numbers.

                The algorithm in FP is quite elegant. Simply suppose that the infinite
                sequence is produced, then simply merge the three sequences (2*x,3*x,5*x) for
                each x in the infinite sequence we supposed as already produced ; this is
                O(n) complexity for n numbers.

                I simply love those algorithms that run after their tails.

                In haskell, the algorithm is translated as such :

                -- BEGIN SNAP
                -- hamming.hs

                -- Merges two infinite lists
                merge :: (Ord a) => [a] -> [a] -> [a]
                merge (x:xs)(y:ys)
                | x == y = x : merge xs ys
                | x < y = x : merge xs (y:ys)
                | otherwise = y : merge (x:xs) ys

                -- Lazily produce the hamming sequence
                hamming :: [Integer]
                hamming
                = 1 : merge (map (2*) hamming) (merge (map (3*) hamming) (map (5*) hamming))
                -- END SNAP


                In Python, I figured out this implementation :

                -- BEGIN SNAP
                import sys
                from itertools import imap

                ## Merges two infinite lists
                def imerge(xs, ys):
                x = xs.next()
                y = ys.next()
                while True:
                if x == y:
                yield x
                x = xs.next()
                y = ys.next()
                elif x < y:
                yield x
                x = xs.next()
                else: # if y < x:
                yield y
                y = ys.next()

                ## Lazily produce the hamming sequence
                def hamming():
                yield 1 ## Initialize the machine
                for n in imerge(imap(lam bda h: 2*h, hamming()),
                imerge(imap(lam bda h: 3*h, hamming()),
                imap(lambda h: 5*h, hamming()))):
                yield n
                print "Falling out -- We should never get here !!"

                for n in hamming():
                sys.stderr.writ e("%s " % str(n)) ## stderr for unbuffered output
                -- END SNAP


                My goal is not to compare Haskell with Python on a classical FP problem, which
                would be genuine stupidity.

                Nevertheless, while the Haskell version prints Hamming sequence for as longas
                I can stand it, and with very little memory consumation, the Python version
                only prints :
                [color=blue][color=green][color=darkred]
                >>>>>> hamming.py[/color][/color][/color]
                1 2 3 4 5 6 8 9 10 12 15 16 18 20 24 25 27 30 32 36 40 45 48 50 54 60 64 7275
                80 81 90 96 100 108 120 125 128 135 144 150 160 162 180 192 200 216 225 240
                243 250 256 270 288 300 320 324 360 375 384 400 405 432 450 480 486 500 512
                540 576 600 625 640 648 675 720 729 750 768 800 810 864 900 960 972 1000 1024
                1080 1125 1152 1200 1215 1250 1280 1296 1350 1440 1458 1500 1536 1600 1620
                1728 1800 1875 1920 1944 2000 2025 2048 2160 2187 2250 2304 2400 2430 2500
                2560 2592 2700 2880 2916 3000 3072 3125 3200 3240 3375 3456 3600 3645 3750
                3840 3888 4000 4050 4096 4320 4374 4500 4608 4800 4860 5000 5120 5184 5400
                5625 5760 5832 6000 6075 6144 6250 6400 6480 6561 6750 6912 7200 7290 7500
                7680 7776 8000 8100 8192 8640 8748 9000 9216 9375 9600 9720 10000 10125 10240
                10368 10800 10935 11250 11520 11664 12000 12150 12288 12500 12800 12960 13122
                13500 13824 14400 14580 15000 15360 15552 15625 16000 16200 16384 16875 17280
                17496 18000 18225 18432 18750 19200 19440 19683 20000 20250 20480 20736 21600
                21870 22500 23040 23328 24000 24300 24576 25000 25600 25920 26244 27000 27648
                28125 28800 29160 30000 30375 30720 31104 31250 32000 32400 32768 32805 33750
                34560 34992 36000 36450 36864 37500 38400 38880 39366 40000 40500 40960 41472
                43200 43740 45000 46080 46656 46875 48000 48600 49152 50000 50625 51200 51840
                52488 54000 54675 55296 56250 57600
                Processus arrêté

                After 57600, my machine begins swapping like crazy and I do have to kill the
                python processus.

                I think I should not have this kind of behaviour, even using recursion, since
                I'm only using lazy constructs all the time. At least, I would expect the
                program to produce much more results before surrending.

                What's going on ?

                Thank you

                Francis Girard
                FRANCE

                Comment

                • Jeff Epler

                  #23
                  Re: Classical FP problem in python : Hamming problem

                  Your formulation in Python is recursive (hamming calls hamming()) and I
                  think that's why your program gives up fairly early.

                  Instead, use itertools.tee() [new in Python 2.4, or search the internet
                  for an implementation that works in 2.3] to give a copy of the output
                  sequence to each "multiply by N" function as well as one to be the
                  return value.

                  Here's my implementation, which matched your list early on but
                  effortlessly reached larger values. One large value it printed was
                  641235181318963 2 (a Python long) which indeed has only the distinct
                  prime factors 2 and 3. (2**43 * 3**6)

                  Jeff

                  from itertools import tee
                  import sys

                  def imerge(xs, ys):
                  x = xs.next()
                  y = ys.next()
                  while True:
                  if x == y:
                  yield x
                  x = xs.next()
                  y = ys.next()
                  elif x < y:
                  yield x
                  x = xs.next()
                  else:
                  yield y
                  y = ys.next()

                  def hamming():
                  def _hamming(j, k):
                  yield 1
                  hamming = generators[j]
                  for i in hamming:
                  yield i * k
                  generators = []
                  generator = imerge(imerge(_ hamming(0, 2), _hamming(1, 3)), _hamming(2, 5))
                  generators[:] = tee(generator, 4)
                  return generators[3]

                  for i in hamming():
                  print i,
                  sys.stdout.flus h()

                  -----BEGIN PGP SIGNATURE-----
                  Version: GnuPG v1.2.6 (GNU/Linux)

                  iD8DBQFB9CTeJd0 1MZaTXX0RAlHPAJ 4gJwuPxCzBdnWiM 1/Rs3eUPk1HMwCeJU Xh
                  selUGrwJc9T8wE6 aCQOjq+Q=
                  =AW5F
                  -----END PGP SIGNATURE-----

                  Comment

                  • Tim Peters

                    #24
                    Re: Classical FP problem in python : Hamming problem

                    [Francis Girard][color=blue]
                    > ...
                    > In the meantime, I couldn't resist to test the new Python features about
                    > laziness on a classical FP problem, i.e. the "Hamming" problem.[/color]
                    ....[color=blue]
                    > Nevertheless, while the Haskell version prints Hamming sequence for as long as
                    > I can stand it, and with very little memory consumation, the Python version
                    > only prints :[/color]
                    ....[color=blue]
                    > I think I should not have this kind of behaviour, even using recursion, since
                    > I'm only using lazy constructs all the time. At least, I would expect the
                    > program to produce much more results before surrending.
                    >
                    > What's going on ?[/color]

                    You can find an explanation in Lib/test/test_generators .py, which uses
                    this problem as an example; you can also find one efficient way to
                    write it there.

                    Comment

                    • Francis Girard

                      #25
                      Re: Classical FP problem in python : Hamming problem

                      Ok, I think that the bottom line is this :

                      For all the algorithms that run after their tail in an FP way, like the
                      Hamming problem, or the Fibonacci sequence, (but unlike Sieve of Eratosthene
                      -- there's a subtle difference), i.e. all those algorithms that typically
                      rely upon recursion to get the beginning of the generated elements in order
                      to generate the next one ...

                      ... so for this family of algorithms, we have, somehow, to abandon recursion,
                      and to explicitely maintain one or many lists of what had been generated.

                      One solution for this is suggested in "test_generator s.py". But I think that
                      this solution is evil as it looks like recursion is used but it is not and it
                      depends heavily on how the m235 function (for example) is invoked.
                      Furthermore, it is _NOT_ memory efficient as pretended : it leaks ! It
                      internally maintains a lists of generated results that grows forever while it
                      is useless to maintain results that had been "consumed". Just for a gross
                      estimate, on my system, percentage of memory usage for this program grows
                      rapidly, reaching 21.6 % after 5 minutes. CPU usage varies between 31%-36%.
                      Ugly and inefficient.

                      The solution of Jeff Epler is far more elegant. The "itertools. tee" function
                      does just what we want. It internally maintain a list of what had been
                      generated and deletes the "consumed" elements as the algo unrolls. To follow
                      with my gross estimate, memory usage grows from 1.2% to 1.9% after 5 minutes
                      (probably only affected by the growing size of Huge Integer). CPU usage
                      varies between 27%-29%.
                      Beautiful and effecient.

                      You might think that we shouldn't be that fussy about memory usage on
                      generating 100 digits numbers but we're talking about a whole family of
                      useful FP algorithms ; and the best way to implement them should be
                      established.

                      For this family of algorithms, itertools.tee is the way to go.

                      I think that the semi-tutorial in "test_generator s.py" should be updated to
                      use "tee". Or, at least, a severe warning comment should be written.

                      Thank you,

                      Francis Girard
                      FRANCE

                      Le dimanche 23 Janvier 2005 23:27, Jeff Epler a écrit :[color=blue]
                      > Your formulation in Python is recursive (hamming calls hamming()) and I
                      > think that's why your program gives up fairly early.
                      >
                      > Instead, use itertools.tee() [new in Python 2.4, or search the internet
                      > for an implementation that works in 2.3] to give a copy of the output
                      > sequence to each "multiply by N" function as well as one to be the
                      > return value.
                      >
                      > Here's my implementation, which matched your list early on but
                      > effortlessly reached larger values. One large value it printed was
                      > 641235181318963 2 (a Python long) which indeed has only the distinct
                      > prime factors 2 and 3. (2**43 * 3**6)
                      >
                      > Jeff
                      >
                      > from itertools import tee
                      > import sys
                      >
                      > def imerge(xs, ys):
                      > x = xs.next()
                      > y = ys.next()
                      > while True:
                      > if x == y:
                      > yield x
                      > x = xs.next()
                      > y = ys.next()
                      > elif x < y:
                      > yield x
                      > x = xs.next()
                      > else:
                      > yield y
                      > y = ys.next()
                      >
                      > def hamming():
                      > def _hamming(j, k):
                      > yield 1
                      > hamming = generators[j]
                      > for i in hamming:
                      > yield i * k
                      > generators = []
                      > generator = imerge(imerge(_ hamming(0, 2), _hamming(1, 3)), _hamming(2,
                      > 5)) generators[:] = tee(generator, 4)
                      > return generators[3]
                      >
                      > for i in hamming():
                      > print i,
                      > sys.stdout.flus h()[/color]

                      Comment

                      • Francis Girard

                        #26
                        Re: Classical FP problem in python : Hamming problem

                        The following implementation is even more speaking as it makes self-evident
                        and almost mechanical how to translate algorithms that run after their tail
                        from recursion to "tee" usage :

                        *** BEGIN SNAP
                        from itertools import tee, imap
                        import sys

                        def imerge(xs, ys):
                        x = xs.next()
                        y = ys.next()
                        while True:
                        if x == y:
                        yield x
                        x = xs.next()
                        y = ys.next()
                        elif x < y:
                        yield x
                        x = xs.next()
                        else:
                        yield y
                        y = ys.next()


                        def hamming():
                        def _hamming():
                        yield 1
                        hamming2 = hammingGenerato rs[0]
                        hamming3 = hammingGenerato rs[1]
                        hamming5 = hammingGenerato rs[2]
                        for n in imerge(imap(lam bda h: 2*h, iter(hamming2)) ,
                        imerge(imap(lam bda h: 3*h, iter(hamming3)) ,
                        imap(lambda h: 5*h, iter(hamming5)) )):
                        yield n
                        hammingGenerato rs = tee(_hamming(), 4)
                        return hammingGenerato rs[3]

                        for i in hamming():
                        print i,
                        sys.stdout.flus h()
                        *** END SNAP

                        Here's an implementation of the fibonacci sequence that uses "tee" :

                        *** BEGIN SNAP
                        from itertools import tee
                        import sys

                        def fib():
                        def _fib():
                        yield 1
                        yield 1
                        curGen = fibGenerators[0]
                        curAheadGen = fibGenerators[1]
                        curAheadGen.nex t()
                        while True:
                        yield curGen.next() + curAheadGen.nex t()
                        fibGenerators = tee(_fib(), 3)
                        return fibGenerators[2]

                        for n in fib():
                        print n,
                        sys.stdout.flus h()
                        *** END SNAP

                        Francis Girard
                        FRANCE


                        Le lundi 24 Janvier 2005 14:09, Francis Girard a écrit :[color=blue]
                        > Ok, I think that the bottom line is this :
                        >
                        > For all the algorithms that run after their tail in an FP way, like the
                        > Hamming problem, or the Fibonacci sequence, (but unlike Sieve of
                        > Eratosthene -- there's a subtle difference), i.e. all those algorithms that
                        > typically rely upon recursion to get the beginning of the generated
                        > elements in order to generate the next one ...
                        >
                        > ... so for this family of algorithms, we have, somehow, to abandon
                        > recursion, and to explicitely maintain one or many lists of what had been
                        > generated.
                        >
                        > One solution for this is suggested in "test_generator s.py". But I think
                        > that this solution is evil as it looks like recursion is used but it is not
                        > and it depends heavily on how the m235 function (for example) is invoked.
                        > Furthermore, it is _NOT_ memory efficient as pretended : it leaks ! It
                        > internally maintains a lists of generated results that grows forever while
                        > it is useless to maintain results that had been "consumed". Just for a
                        > gross estimate, on my system, percentage of memory usage for this program
                        > grows rapidly, reaching 21.6 % after 5 minutes. CPU usage varies between
                        > 31%-36%. Ugly and inefficient.
                        >
                        > The solution of Jeff Epler is far more elegant. The "itertools. tee"
                        > function does just what we want. It internally maintain a list of what had
                        > been generated and deletes the "consumed" elements as the algo unrolls. To
                        > follow with my gross estimate, memory usage grows from 1.2% to 1.9% after5
                        > minutes (probably only affected by the growing size of Huge Integer). CPU
                        > usage varies between 27%-29%.
                        > Beautiful and effecient.
                        >
                        > You might think that we shouldn't be that fussy about memory usage on
                        > generating 100 digits numbers but we're talking about a whole family of
                        > useful FP algorithms ; and the best way to implement them should be
                        > established.
                        >
                        > For this family of algorithms, itertools.tee is the way to go.
                        >
                        > I think that the semi-tutorial in "test_generator s.py" should be updated to
                        > use "tee". Or, at least, a severe warning comment should be written.
                        >
                        > Thank you,
                        >
                        > Francis Girard
                        > FRANCE
                        >
                        > Le dimanche 23 Janvier 2005 23:27, Jeff Epler a écrit :[color=green]
                        > > Your formulation in Python is recursive (hamming calls hamming()) and I
                        > > think that's why your program gives up fairly early.
                        > >
                        > > Instead, use itertools.tee() [new in Python 2.4, or search the internet
                        > > for an implementation that works in 2.3] to give a copy of the output
                        > > sequence to each "multiply by N" function as well as one to be the
                        > > return value.
                        > >
                        > > Here's my implementation, which matched your list early on but
                        > > effortlessly reached larger values. One large value it printed was
                        > > 641235181318963 2 (a Python long) which indeed has only the distinct
                        > > prime factors 2 and 3. (2**43 * 3**6)
                        > >
                        > > Jeff
                        > >
                        > > from itertools import tee
                        > > import sys
                        > >
                        > > def imerge(xs, ys):
                        > > x = xs.next()
                        > > y = ys.next()
                        > > while True:
                        > > if x == y:
                        > > yield x
                        > > x = xs.next()
                        > > y = ys.next()
                        > > elif x < y:
                        > > yield x
                        > > x = xs.next()
                        > > else:
                        > > yield y
                        > > y = ys.next()
                        > >
                        > > def hamming():
                        > > def _hamming(j, k):
                        > > yield 1
                        > > hamming = generators[j]
                        > > for i in hamming:
                        > > yield i * k
                        > > generators = []
                        > > generator = imerge(imerge(_ hamming(0, 2), _hamming(1, 3)),
                        > > _hamming(2, 5)) generators[:] = tee(generator, 4)
                        > > return generators[3]
                        > >
                        > > for i in hamming():
                        > > print i,
                        > > sys.stdout.flus h()[/color][/color]

                        Comment

                        • Tim Peters

                          #27
                          Re: Classical FP problem in python : Hamming problem

                          [Francis Girard][color=blue]
                          > For all the algorithms that run after their tail in an FP way, like the
                          > Hamming problem, or the Fibonacci sequence, (but unlike Sieve of Eratosthene
                          > -- there's a subtle difference), i.e. all those algorithms that typically
                          > rely upon recursion to get the beginning of the generated elements in order
                          > to generate the next one ...
                          >
                          > ... so for this family of algorithms, we have, somehow, to abandon recursion,
                          > and to explicitely maintain one or many lists of what had been generated.
                          >
                          > One solution for this is suggested in "test_generator s.py". But I think that
                          > this solution is evil as it looks like recursion is used but it is not and it
                          > depends heavily on how the m235 function (for example) is invoked.[/color]

                          Well, yes -- "Heh. Here's one way to get a shared list, complete with
                          an excruciating namespace renaming trick" was intended to warn you in
                          advance that it wasn't pretty <wink>.
                          [color=blue]
                          > Furthermore, it is _NOT_ memory efficient as pretended : it leaks ![/color]

                          Yes. But there are two solutions to the problem in that file, and the
                          second one is in fact extremely memory-efficient compared to the first
                          one. "Efficiency " here was intended in a relative sense.
                          [color=blue]
                          > It internally maintains a lists of generated results that grows forever while it
                          > is useless to maintain results that had been "consumed". Just for a gross
                          > estimate, on my system, percentage of memory usage for this program grows
                          > rapidly, reaching 21.6 % after 5 minutes. CPU usage varies between 31%-36%.
                          > Ugly and inefficient.[/color]

                          Try the first solution in the file for a better understanding of what
                          "inefficien t" means <wink>.
                          [color=blue]
                          > The solution of Jeff Epler is far more elegant. The "itertools. tee" function
                          > does just what we want. It internally maintain a list of what had been
                          > generated and deletes the "consumed" elements as the algo unrolls. To follow
                          > with my gross estimate, memory usage grows from 1.2% to 1.9% after 5 minutes
                          > (probably only affected by the growing size of Huge Integer). CPU usage
                          > varies between 27%-29%.
                          > Beautiful and effecient.[/color]

                          Yes, it is better. tee() didn't exist when generators (and
                          test_generators .py) were written, so of course nothing in the test
                          file uses them.
                          [color=blue]
                          > You might think that we shouldn't be that fussy about memory usage on
                          > generating 100 digits numbers but we're talking about a whole family of
                          > useful FP algorithms ; and the best way to implement them should be
                          > established.[/color]

                          Possibly -- there really aren't many Pythonistas who care about this.
                          [color=blue]
                          > For this family of algorithms, itertools.tee is the way to go.
                          >
                          > I think that the semi-tutorial in "test_generator s.py" should be updated to
                          > use "tee". Or, at least, a severe warning comment should be written.[/color]

                          Please submit a patch. The purpose of that file is to test
                          generators, so you should add a third way of doing it, not replace the
                          two ways already there. It should also contain prose explaining why
                          the third way is better (just as there's prose now explaining why the
                          second way is better than the first).

                          Comment

                          • Francis Girard

                            #28
                            Re: Classical FP problem in python : Hamming problem

                            Ok I'll submit the patch with the prose pretty soon.
                            Thank you
                            Francis Girard
                            FRANCE

                            Le mardi 25 Janvier 2005 04:29, Tim Peters a écrit :[color=blue]
                            > [Francis Girard]
                            >[color=green]
                            > > For all the algorithms that run after their tail in an FP way, like the
                            > > Hamming problem, or the Fibonacci sequence, (but unlike Sieve of
                            > > Eratosthene -- there's a subtle difference), i.e. all those algorithms
                            > > that typically rely upon recursion to get the beginning of the generated
                            > > elements in order to generate the next one ...
                            > >
                            > > ... so for this family of algorithms, we have, somehow, to abandon
                            > > recursion, and to explicitely maintain one or many lists of what had been
                            > > generated.
                            > >
                            > > One solution for this is suggested in "test_generator s.py". But I think
                            > > that this solution is evil as it looks like recursion is used but it is
                            > > not and it depends heavily on how the m235 function (for example) is
                            > > invoked.[/color]
                            >
                            > Well, yes -- "Heh. Here's one way to get a shared list, complete with
                            > an excruciating namespace renaming trick" was intended to warn you in
                            > advance that it wasn't pretty <wink>.
                            >[color=green]
                            > > Furthermore, it is _NOT_ memory efficient as pretended : it leaks ![/color]
                            >
                            > Yes. But there are two solutions to the problem in that file, and the
                            > second one is in fact extremely memory-efficient compared to the first
                            > one. "Efficiency " here was intended in a relative sense.
                            >[color=green]
                            > > It internally maintains a lists of generated results that grows forever
                            > > while it is useless to maintain results that had been "consumed". Just
                            > > for a gross estimate, on my system, percentage of memory usage for this
                            > > program grows rapidly, reaching 21.6 % after 5 minutes. CPU usage varies
                            > > between 31%-36%. Ugly and inefficient.[/color]
                            >
                            > Try the first solution in the file for a better understanding of what
                            > "inefficien t" means <wink>.
                            >[color=green]
                            > > The solution of Jeff Epler is far more elegant. The "itertools. tee"
                            > > function does just what we want. It internally maintain a list of what
                            > > had been generated and deletes the "consumed" elements as the algo
                            > > unrolls. To follow with my gross estimate, memory usage grows from 1.2%
                            > > to 1.9% after 5 minutes (probably only affected by the growing size of
                            > > Huge Integer). CPU usage varies between 27%-29%.
                            > > Beautiful and effecient.[/color]
                            >
                            > Yes, it is better. tee() didn't exist when generators (and
                            > test_generators .py) were written, so of course nothing in the test
                            > file uses them.
                            >[color=green]
                            > > You might think that we shouldn't be that fussy about memory usage on
                            > > generating 100 digits numbers but we're talking about a whole family of
                            > > useful FP algorithms ; and the best way to implement them should be
                            > > established.[/color]
                            >
                            > Possibly -- there really aren't many Pythonistas who care about this.
                            >[color=green]
                            > > For this family of algorithms, itertools.tee is the way to go.
                            > >
                            > > I think that the semi-tutorial in "test_generator s.py" should be updated
                            > > to use "tee". Or, at least, a severe warning comment should be written.[/color]
                            >
                            > Please submit a patch. The purpose of that file is to test
                            > generators, so you should add a third way of doing it, not replace the
                            > two ways already there. It should also contain prose explaining why
                            > the third way is better (just as there's prose now explaining why the
                            > second way is better than the first).[/color]

                            Comment

                            • Michael Spencer

                              #29
                              Re: Classical FP problem in python : Hamming problem

                              Francis Girard wrote:[color=blue]
                              > The following implementation is even more speaking as it makes self-evident
                              > and almost mechanical how to translate algorithms that run after their tail
                              > from recursion to "tee" usage :
                              >[/color]

                              Thanks, Francis and Jeff for raising a fascinating topic. I've enjoyed trying
                              to get my head around both the algorithm and your non-recursive implementation.

                              Here's a version of your implementation that uses a helper class to make the
                              algorithm itself prettier.

                              from itertools import tee, imap

                              def hamming():
                              def _hamming():
                              yield 1
                              for n in imerge(2 * hamming, imerge(3 * hamming, 5 * hamming)):
                              yield n

                              hamming = Tee(_hamming())
                              return iter(hamming)


                              class Tee(object):
                              """Provides an indepent iterator (using tee) on every iteration request
                              Also implements lazy iterator arithmetic"""
                              def __init__(self, iterator):
                              self.iter = tee(iterator,1)[0]
                              def __iter__(self):
                              return self.iter.__cop y__()
                              def __mul__(self, number):
                              return imap(lambda x: x * number,self.__i ter__())

                              def imerge(xs, ys):
                              x = xs.next()
                              y = ys.next()
                              while True:
                              if x == y:
                              yield x
                              x = xs.next()
                              y = ys.next()
                              elif x < y:
                              yield x
                              x = xs.next()
                              else: # if y < x:
                              yield y
                              y = ys.next()
                              [color=blue][color=green][color=darkred]
                              >>> hg = hamming()
                              >>> for i in range(10000):[/color][/color][/color]
                              .... n = hg.next()
                              .... if i % 1000 == 0: print i, n
                              ....
                              0 1
                              1000 51840000
                              2000 8100000000
                              3000 279936000000
                              4000 4707158941350
                              5000 50960793600000
                              6000 409600000000000
                              7000 263882790666240 0
                              8000 143327232000000 00
                              9000 680244480000000 00


                              Regards

                              Michael

                              Comment

                              • Nick Craig-Wood

                                #30
                                Re: Classical FP problem in python : Hamming problem

                                Francis Girard <francis.girard @free.fr> wrote:[color=blue]
                                > def hamming():
                                > def _hamming():
                                > yield 1
                                > hamming2 = hammingGenerato rs[0]
                                > hamming3 = hammingGenerato rs[1]
                                > hamming5 = hammingGenerato rs[2]
                                > for n in imerge(imap(lam bda h: 2*h, iter(hamming2)) ,
                                > imerge(imap(lam bda h: 3*h, iter(hamming3)) ,
                                > imap(lambda h: 5*h, iter(hamming5)) )):
                                > yield n
                                > hammingGenerato rs = tee(_hamming(), 4)
                                > return hammingGenerato rs[3][/color]

                                If you are after readability, you might prefer this...

                                def hamming():
                                def _hamming():
                                yield 1
                                for n in imerge(imap(lam bda h: 2*h, iter(hamming2)) ,
                                imerge(imap(lam bda h: 3*h, iter(hamming3)) ,
                                imap(lambda h: 5*h, iter(hamming5)) )):
                                yield n
                                hamming2, hamming3, hamming5, result = tee(_hamming(), 4)
                                return result

                                PS interesting thread - never heard of Hamming sequences before!
                                --
                                Nick Craig-Wood <nick@craig-wood.com> -- http://www.craig-wood.com/nick

                                Comment

                                Working...