Re: dict generator question

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Gerard flanagan

    Re: dict generator question

    Simon Mullis wrote:
    Hi,
    >
    Let's say I have an arbitrary list of minor software versions of an
    imaginary software product:
    >
    l = [ "1.1.1.1", "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.3.4.5"]
    >
    I'd like to create a dict with major_version : count.
    >
    (So, in this case:
    >
    dict_of_counts = { "1.1" : "1",
    "1.2" : "2",
    "1.3" : "2" }
    >
    [...]
    data = [ "1.1.1.1", "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.3.4.5"]

    from itertools import groupby

    datadict = \
    dict((k, len(list(g))) for k,g in groupby(data, lambda s: s[:3]))
    print datadict




  • George Sakkis

    #2
    Re: dict generator question

    On Sep 18, 11:43 am, Gerard flanagan <grflana...@gma il.comwrote:
    Simon Mullis wrote:
    Hi,
    >
    Let's say I have an arbitrary list of minor software versions of an
    imaginary software product:
    >
    l = [ "1.1.1.1", "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.3.4.5"]
    >
    I'd like to create a dict with major_version : count.
    >
    (So, in this case:
    >
    dict_of_counts = { "1.1" : "1",
    "1.2" : "2",
    "1.3" : "2" }
    >
    [...]
    data = [ "1.1.1.1", "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.3.4.5"]
    >
    from itertools import groupby
    >
    datadict = \
    dict((k, len(list(g))) for k,g in groupby(data, lambda s: s[:3]))
    print datadict
    Note that this works correctly only if the versions are already sorted
    by major version.

    George

    Comment

    • Gerard flanagan

      #3
      Re: dict generator question

      George Sakkis wrote:
      On Sep 18, 11:43 am, Gerard flanagan <grflana...@gma il.comwrote:
      >Simon Mullis wrote:
      >>Hi,
      >>Let's say I have an arbitrary list of minor software versions of an
      >>imaginary software product:
      >>l = [ "1.1.1.1", "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.3.4.5"]
      >>I'd like to create a dict with major_version : count.
      >>(So, in this case:
      >>dict_of_count s = { "1.1" : "1",
      >> "1.2" : "2",
      >> "1.3" : "2" }
      >[...]
      >data = [ "1.1.1.1", "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.3.4.5"]
      >>
      >from itertools import groupby
      >>
      >datadict = \
      > dict((k, len(list(g))) for k,g in groupby(data, lambda s: s[:3]))
      >print datadict
      >
      Note that this works correctly only if the versions are already sorted
      by major version.
      >
      Yes, I should have mentioned it. Here's a fuller example below. There's
      maybe better ways of sorting version numbers, but this is what I do.


      data = [ "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.1.1.1", "1.3.14.5",
      "1.3.21.6" ]

      from itertools import groupby
      import re

      RXBUILDSORT = re.compile(r'\d +|[a-zA-Z]')

      def versionsort(s):
      key = []
      for part in RXBUILDSORT.fin dall(s.lower()) :
      try:
      key.append(int( part))
      except ValueError:
      key.append(ord( part))
      return tuple(key)

      data.sort(key=v ersionsort)
      print data

      datadict = \
      dict((k, len(list(g))) for k,g in groupby(data, lambda s: s[:3]))
      print datadict




      Comment

      • bearophileHUGS@lycos.com

        #4
        Re: dict generator question

        Gerard flanagan:
        data.sort()
        datadict = \
        dict((k, len(list(g))) for k,g in groupby(data, lambda s:
        '.'.join(s.spli t('.',2)[:2])))
        That code may run correctly, but it's quite unreadable, while good
        Python programmers value high readability. So the right thing to do is
        to split that line into parts, giving meaningful names, and maybe even
        add comments.

        len(list(g))) looks like a good job for my little leniter() function
        (or better just an extension to the semantics of len) that time ago
        some people here have judged as useless, while I use it often in both
        Python and D ;-)

        Bye,
        bearophile

        Comment

        • MRAB

          #5
          Re: dict generator question

          On Sep 19, 2:01 pm, bearophileH...@ lycos.com wrote:
          Gerard flanagan:
          >
          data.sort()
          datadict = \
          dict((k, len(list(g))) for k,g in groupby(data, lambda s:
               '.'.join(s.spli t('.',2)[:2])))
          >
          That code may run correctly, but it's quite unreadable, while good
          Python programmers value high readability. So the right thing to do is
          to split that line into parts, giving meaningful names, and maybe even
          add comments.
          >
          len(list(g))) looks like a good job for my little leniter() function
          (or better just an extension to the semantics of len) that time ago
          some people here have judged as useless, while I use it often in both
          Python and D ;-)
          >
          Extending len() to support iterables sounds like a good idea, except
          that it could be misleading when:

          len(file(path))

          returns the number of lines and /not/ the length in bytes as you might
          first think! :-)

          Anyway, here's another possible implementation using bags (multisets):

          def major_version(v ersion_string):
          "convert '1.2.3.2' to '1.2'"
          return '.'.join(versio n_string.split( '.')[:2])

          versions = ["1.1.1.1", "1.2.2.2", "1.2.2.3", "1.3.1.2", "1.3.4.5"]

          bag_of_versions = bag(major_versi on(x) for x in versions)
          dict_of_counts = dict(bag_of_ver sions.items())

          Here's my implementation of the bag class in Python (sorry about the
          length):

          class bag(object):
          def __init__(self, iterable = None):
          self._counts = {}
          if isinstance(iter able, dict):
          for x, n in iterable.items( ):
          if not isinstance(n, int):
          raise TypeError()
          if n < 0:
          raise ValueError()
          self._counts[x] = n
          elif iterable:
          for x in iterable:
          try:
          self._counts[x] += 1
          except KeyError:
          self._counts[x] = 1
          def __and__(self, other):
          new_counts = {}
          for x, n in other._counts.i tems():
          try:
          new_counts[x] = min(self._count s[x], n)
          except KeyError:
          pass
          result = bag()
          result._counts = new_counts
          return result
          def __iand__(self):
          new_counts = {}
          for x, n in other._counts.i tems():
          try:
          new_counts[x] = min(self._count s[x], n)
          except KeyError:
          pass
          self._counts = new_counts
          def __or__(self, other):
          new_counts = self._counts.co py()
          for x, n in other._counts.i tems():
          try:
          new_counts[x] = max(new_counts[x], n)
          except KeyError:
          new_counts[x] = n
          result = bag()
          result._counts = new_counts
          return result
          def __ior__(self):
          for x, n in other._counts.i tems():
          try:
          self._counts[x] = max(self._count s[x], n)
          except KeyError:
          self._counts[x] = n
          def __len__(self):
          return sum(self._count s.values())
          def __list__(self):
          result = []
          for x, n in self._counts.it ems():
          result.extend([x] * n)
          return result
          def __repr__(self):
          return "bag([%s])" % ", ".join(", ".join([repr(x)] * n) for x,
          n in self._counts.it ems())
          def __iter__(self):
          for x, n in self._counts.it ems():
          for i in range(n):
          yield x
          def keys(self):
          return self._counts.ke ys()
          def values(self):
          return self._counts.va lues()
          def items(self):
          return self._counts.it ems()
          def __add__(self, other):
          for x, n in other.items():
          self._counts[x] = self._counts.ge t(x, 0) + n
          def __contains__(se lf, x):
          return x in self._counts
          def add(self, x):
          try:
          self._counts[x] += 1
          except KeyError:
          self._counts[x] = 1
          def __add__(self, other):
          new_counts = self._counts.co py()
          for x, n in other.items():
          try:
          new_counts[x] += n
          except KeyError:
          new_counts[x] = n
          result = bag()
          result._counts = new_counts
          return result
          def __sub__(self, other):
          new_counts = self._counts.co py()
          for x, n in other.items():
          try:
          new_counts[x] -= n
          if new_counts[x] < 1:
          del new_counts[x]
          except KeyError:
          pass
          result = bag()
          result._counts = new_counts
          return result
          def __iadd__(self, other):
          for x, n in other.items():
          try:
          self._counts[x] += n
          except KeyError:
          self._counts[x] = n
          def __isub__(self, other):
          for x, n in other.items():
          try:
          self._counts[x] -= n
          if self._counts[x] < 1:
          del self._counts[x]
          except KeyError:
          pass
          def clear(self):
          self._counts = {}
          def count(self, x):
          return self._counts.ge t(x, 0)

          Comment

          • Steven D'Aprano

            #6
            Re: dict generator question

            On Fri, 19 Sep 2008 17:00:56 -0700, MRAB wrote:
            Extending len() to support iterables sounds like a good idea, except
            that it could be misleading when:
            >
            len(file(path))
            >
            returns the number of lines and /not/ the length in bytes as you might
            first think!
            Extending len() to support iterables sounds like a good idea, except that
            it's not.

            Here are two iterables:


            def yes(): # like the Unix yes command
            while True:
            yield "y"

            def rand(total):
            "Return random numbers up to a given total."
            from random import random
            tot = 0.0
            while tot < total:
            x = random()
            yield x
            tot += x


            What should len(yes()) and len(rand(100)) return?



            --
            Steven

            Comment

            • bearophileHUGS@lycos.com

              #7
              Re: dict generator question

              MRAB:
              except that it could be misleading when:
              len(file(path))
              returns the number of lines and /not/ the length in bytes as you might
              first think! :-)
              Well, file(...) returns an iterable of lines, so its len is the number
              of lines :-)
              I think I am able to always remember this fact.

              Anyway, here's another possible implementation using bags (multisets):
              This function looks safer/faster:

              def major_version(v ersion_string):
              "convert '1.2.3.2' to '1.2'"
              return '.'.join(versio n_string.strip( ).split('.', 2)[:2])

              Another version:

              import re
              patt = re.compile(r"^( \d+\.\d+)")

              dict_of_counts = defaultdict(int )
              for ver in versions:
              dict_of_counts[patt.match(ver) .group(1)] += 1

              print dict_of_counts

              Bye,
              bearophile

              Comment

              • Miles

                #8
                Re: dict generator question

                On Fri, Sep 19, 2008 at 9:51 PM, Steven D'Aprano
                <steve@remove-this-cybersource.com .auwrote:
                Extending len() to support iterables sounds like a good idea, except that
                it's not.
                >
                Here are two iterables:
                >
                >
                def yes(): # like the Unix yes command
                while True:
                yield "y"
                >
                def rand(total):
                "Return random numbers up to a given total."
                from random import random
                tot = 0.0
                while tot < total:
                x = random()
                yield x
                tot += x
                >
                >
                What should len(yes()) and len(rand(100)) return?
                Clearly, len(yes()) would never return, and len(rand(100)) would
                return a random integer not less than 101.

                -Miles

                Comment

                • bearophileHUGS@lycos.com

                  #9
                  Re: dict generator question

                  Steven D'Aprano:
                  >Extending len() to support iterables sounds like a good idea, except that it's not.<
                  Python language lately has shifted toward more and more usage of lazy
                  iterables (see range lazy by default, etc). So they are now quite
                  common. So extending len() to make it act like leniter() too is a way
                  to adapt a basic Python construct to the changes of the other parts of
                  the language.

                  In languages like Haskell you can count how many items a lazy sequence
                  has. But those sequences are generally immutable, so they can be
                  accessed many times, so len(iterable) doesn't exhaust them like in
                  Python. So in Python it's less useful.


                  This is a common situation where I can only care of the len of the g
                  group:
                  [leniter(g) for h,g in groupby(iterabl e)]

                  There are other situations where I may be interested only in how many
                  items there are:
                  leniter(ifilter (predicate, iterable))
                  leniter(el for el in iterable if predicate(el))

                  For my usage I have written a version of the itertools module in D (a
                  lot of work, but the result is quite useful and flexible, even if I
                  miss the generator/iterator syntax a lot), and later I have written a
                  len() able to count the length of lazy iterables too (if the given
                  variable has a length attribute/property then it returns that value),
                  and I have found that it's useful often enough (almost as the
                  string.xsplit() ). But in Python there is less need for a len() that
                  counts lazy iterables too because you can use the following syntax
                  that isn't bad (and isn't available in D):

                  [sum(1 for x in g) for h,g in groupby(iterabl e)]
                  sum(1 for x in ifilter(predica te, iterable))
                  sum(1 for el in iterable if predicate(el))

                  So you and Python designers may choose to not extend the semantics of
                  len() for various good reasons, but you will have a hard time
                  convincing me it's a useless capability :-)

                  Bye,
                  bearophile

                  Comment

                  • Steven D'Aprano

                    #10
                    Re: dict generator question

                    On Mon, 22 Sep 2008 04:21:12 -0700, bearophileHUGS wrote:
                    Steven D'Aprano:
                    >
                    >>Extending len() to support iterables sounds like a good idea, except
                    >>that it's not.<
                    >
                    Python language lately has shifted toward more and more usage of lazy
                    iterables (see range lazy by default, etc). So they are now quite
                    common. So extending len() to make it act like leniter() too is a way to
                    adapt a basic Python construct to the changes of the other parts of the
                    language.
                    I'm sorry, I don't recognise leniter(). Did I miss something?

                    In languages like Haskell you can count how many items a lazy sequence
                    has. But those sequences are generally immutable, so they can be
                    accessed many times, so len(iterable) doesn't exhaust them like in
                    Python. So in Python it's less useful.
                    In Python, xrange() is a lazy sequence that isn't exhausted, but that's a
                    special case: it actually has a __len__ method, and presumably the length
                    is calculated from the xrange arguments, not by generating all the items
                    and counting them. How would you count the number of items in a generic
                    lazy sequence without actually generating the items first?

                    This is a common situation where I can only care of the len of the g
                    group:
                    [leniter(g) for h,g in groupby(iterabl e)]
                    >
                    There are other situations where I may be interested only in how many
                    items there are:
                    leniter(ifilter (predicate, iterable)) leniter(el for el in iterable if
                    predicate(el))
                    >
                    For my usage I have written a version of the itertools module in D (a
                    lot of work, but the result is quite useful and flexible, even if I miss
                    the generator/iterator syntax a lot), and later I have written a len()
                    able to count the length of lazy iterables too (if the given variable
                    has a length attribute/property then it returns that value),
                    I'm not saying that no iterables can accurately predict how many items
                    they will produce. If they can, then len() should support iterables with
                    a __len__ attribute. But in general there's no way of predicting how many
                    items the iterable will produce without iterating over it, and len()
                    shouldn't do that.

                    and I have
                    found that it's useful often enough (almost as the string.xsplit() ). But
                    in Python there is less need for a len() that counts lazy iterables too
                    because you can use the following syntax that isn't bad (and isn't
                    available in D):
                    >
                    [sum(1 for x in g) for h,g in groupby(iterabl e)] sum(1 for x in
                    ifilter(predica te, iterable)) sum(1 for el in iterable if predicate(el))
                    I think the idiom sum(1 for item in iterable) is, in general, a mistake.
                    For starters, it doesn't work for arbitrary iterables, only sequences
                    (lazy or otherwise) and your choice of variable name may fool people into
                    thinking they can pass a use-once iterator to your code and have it work.

                    Secondly, it's not clear what sum(1 for item in iterable) does without
                    reading over it carefully. Since you're generating the entire length
                    anyway, len(list(iterab le)) is more readable and almost as efficient for
                    most practical cases.

                    As things stand now, list(iterable) is a "dangerous" operation, as it may
                    consume arbitrarily huge resources. But len() isn't[1], because len()
                    doesn't operate on arbitrary iterables. This is a good thing.

                    So you and Python designers may choose to not extend the semantics of
                    len() for various good reasons, but you will have a hard time convincing
                    me it's a useless capability :-)
                    I didn't say that knowing the length of iterators up front was useless.
                    Sometimes it may be useful, but it is rarely (never?) essential.





                    [1] len(x) may call x.__len__() which might do anything. But the expected
                    semantics of __len__ is that it is expected to return an int, and do it
                    quickly with minimal effort. Methods that do something else are an abuse
                    of __len__ and should be treated as a bug.

                    --
                    Steven

                    Comment

                    • bearophileHUGS@lycos.com

                      #11
                      Re: dict generator question

                      Steven D'Aprano:
                      >I'm sorry, I don't recognise leniter(). Did I miss something?<
                      I have removed the docstring/doctests:

                      def leniter(iterato r):
                      if hasattr(iterato r, "__len__"):
                      return len(iterator)
                      nelements = 0
                      for _ in iterator:
                      nelements += 1
                      return nelements

                      >it doesn't work for arbitrary iterables, only sequences (lazy or otherwise)<
                      I don't understand well.

                      >Since you're generating the entire length anyway, len(list(iterab le)) is more readable and almost as efficient for most practical cases.<
                      I don't agree, len(list()) creates an actual list, with lot of GC
                      activity.

                      >But the expected semantics of __len__ is that it is expected to return an int, and do it quickly with minimal effort. Methods that do something else are an abuse of __len__ and should be treated as a bug.<
                      I see. In the past I have read similar positions in discussions
                      regarding API of data structures in D, so this may be right, and this
                      fault may be enough to kill my proposal. But I'll keep using
                      leniter().

                      Bye,
                      bearophile

                      Comment

                      Working...