File to dict

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • mrkafk@gmail.com

    File to dict

    Hello everyone,

    I have written this small utility function for transforming legacy
    file to Python dict:


    def lookupdmo(domai n):
    lines = open('/etc/virtual/domainowners',' r').readlines()
    lines = [ [y.lstrip().rstr ip() for y in x.split(':')] for x in
    lines]
    lines = [ x for x in lines if len(x) == 2 ]
    d = dict()
    for line in lines:
    d[line[0]]=line[1]
    return d[domain]

    The /etc/virtual/domainowners file contains double-colon separated
    entries:
    domain1.tld: owner1
    domain2.tld: own2
    domain3.another : somebody
    ....

    Now, the above lookupdmo function works. However, it's rather tedious
    to transform files into dicts this way and I have quite a lot of such
    files to transform (like custom 'passwd' files for virtual email
    accounts etc).

    Is there any more clever / more pythonic way of parsing files like
    this? Say, I would like to transform a file containing entries like
    the following into a list of lists with doublecolon treated as
    separators, i.e. this:

    tm:$1$aaaa$bbbb :1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin

    would get transformed into this:

    [ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
    tm', '/sbin/nologin'] [...] [...] ]

  • Chris

    #2
    Re: File to dict

    On Dec 7, 1:31 pm, mrk...@gmail.co m wrote:
    Hello everyone,
    >
    I have written this small utility function for transforming legacy
    file to Python dict:
    >
    def lookupdmo(domai n):
    lines = open('/etc/virtual/domainowners',' r').readlines()
    lines = [ [y.lstrip().rstr ip() for y in x.split(':')] for x in
    lines]
    lines = [ x for x in lines if len(x) == 2 ]
    d = dict()
    for line in lines:
    d[line[0]]=line[1]
    return d[domain]
    >
    The /etc/virtual/domainowners file contains double-colon separated
    entries:
    domain1.tld: owner1
    domain2.tld: own2
    domain3.another : somebody
    ...
    >
    Now, the above lookupdmo function works. However, it's rather tedious
    to transform files into dicts this way and I have quite a lot of such
    files to transform (like custom 'passwd' files for virtual email
    accounts etc).
    >
    Is there any more clever / more pythonic way of parsing files like
    this? Say, I would like to transform a file containing entries like
    the following into a list of lists with doublecolon treated as
    separators, i.e. this:
    >
    tm:$1$aaaa$bbbb :1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin
    >
    would get transformed into this:
    >
    [ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
    tm', '/sbin/nologin'] [...] [...] ]
    For the first one you are parsing the entire file everytime you want
    to lookup just one domain. If it is something reused several times
    during your code execute you could think of rather storing it so it's
    just a simple lookup away, for eg.

    _domain_dict = dict()
    def generate_dict(i nput_file):
    finput = open(input_file , 'rb')
    global _domain_dict
    for each_line in enumerate(finpu t):
    line = each_line.strip ().split(':')
    if len(line)==2: _domain_dict[line[0]] = line[1]

    finput.close()

    def domain_lookup(d omain_name):
    global _domain_dict
    try:
    return _domain_dict[domain_name]
    except KeyError:
    return 'Unknown.Domain '


    Your second parsing example would be a simple case of:

    finput = open('input_fil e.ext', 'rb')
    results_list = []
    for each_line in enumerate(finpu t.readlines()):
    results_list.ap pend( each_line.strip ().split(':') )
    finput.close()

    Comment

    • Duncan Booth

      #3
      Re: File to dict

      mrkafk@gmail.co m wrote:
      def lookupdmo(domai n):
      lines = open('/etc/virtual/domainowners',' r').readlines()
      lines = [ [y.lstrip().rstr ip() for y in x.split(':')] for x in
      lines]
      lines = [ x for x in lines if len(x) == 2 ]
      d = dict()
      for line in lines:
      d[line[0]]=line[1]
      return d[domain]
      Just some minor points without changing the basis of what you have done
      here:

      Don't bother with 'readlines', file objects are directly iterable.
      Why are you calling both lstrip and rstrip? The strip method strips
      whitespace from both ends for you.

      It is usually a good idea with code like this to limit the split method to
      a single split in case there is more than one colon on the line: i.e.
      x.split(':',1)

      When you have a sequence whose elements are sequences with two elements
      (which is what you have here), you can construct a dict directly from the
      sequence.

      But why do you construct a dict from that input data simply to throw it
      away? If you only want 1 domain from the file just pick it out of the list.
      If you want to do multiple lookups build the dict once and keep it around.

      So something like the following (untested code):

      from __future__ import with_statement

      def loaddomainowner s(domain):
      with open('/etc/virtual/domainowners',' r') as infile:
      pairs = [ line.split(':', 1) for line in infile if ':' in line ]
      pairs = [ (domain.strip() , owner.strip())
      for (domain,owner) in pairs ]
      return dict(lines)

      DOMAINOWNERS = loaddomainowner s()

      def lookupdmo(domai n):
      return DOMAINOWNERS[domain]

      Comment

      • Matt Nordhoff

        #4
        Re: File to dict

        Chris wrote:
        For the first one you are parsing the entire file everytime you want
        to lookup just one domain. If it is something reused several times
        during your code execute you could think of rather storing it so it's
        just a simple lookup away, for eg.
        >
        _domain_dict = dict()
        def generate_dict(i nput_file):
        finput = open(input_file , 'rb')
        global _domain_dict
        for each_line in enumerate(finpu t):
        line = each_line.strip ().split(':')
        if len(line)==2: _domain_dict[line[0]] = line[1]
        >
        finput.close()
        >
        def domain_lookup(d omain_name):
        global _domain_dict
        try:
        return _domain_dict[domain_name]
        except KeyError:
        What about this?

        _domain_dict = dict()
        def generate_dict(i nput_file):
        global _domain_dict
        # If it's already been run, do nothing. You might want to change
        # this.
        if _domain_dict:
        return
        fh = open(input_file , 'rb')
        try:
        for line in fh:
        line = line.strip().sp lit(':', 1)
        if len(line) == 2:
        _domain_dict[line[0]] = line[1]
        finally:
        fh.close()

        def domain_lookup(d omain_name):
        return _domain_dict.ge t(domain_name)

        I changed generate_dict to do nothing if it's already been run. (You
        might want it to run again with a fresh dict, or throw an error or
        something.)

        I removed enumerate() because it's unnecessary (and wrong -- you were
        trying to split a tuple of (index, line)).

        I also changed the split to only split once, like Duncan Booth suggested.

        The try-finally is to ensure that the file is closed if an exception is
        thrown for some reason.

        domain_lookup doesn't need to declare _domain_dict as global because
        it's not assigning to it. .get() returns None if the key doesn't exist,
        so now the function returns None. You might want to use a different
        value or throw an exception (use _domain_dict[domain_name] and not catch
        the KeyError if it doesn't exist, perhaps).

        Other than that, I just reformatted it and renamed variables, because I
        do that. :-P
        --

        Comment

        • Matt Nordhoff

          #5
          Re: File to dict

          Duncan Booth wrote:
          Just some minor points without changing the basis of what you have done
          here:
          >
          Don't bother with 'readlines', file objects are directly iterable.
          Why are you calling both lstrip and rstrip? The strip method strips
          whitespace from both ends for you.
          >
          It is usually a good idea with code like this to limit the split method to
          a single split in case there is more than one colon on the line: i.e.
          x.split(':',1)
          >
          When you have a sequence whose elements are sequences with two elements
          (which is what you have here), you can construct a dict directly from the
          sequence.
          >
          But why do you construct a dict from that input data simply to throw it
          away? If you only want 1 domain from the file just pick it out of the list.
          If you want to do multiple lookups build the dict once and keep it around.
          >
          So something like the following (untested code):
          >
          from __future__ import with_statement
          >
          def loaddomainowner s(domain):
          with open('/etc/virtual/domainowners',' r') as infile:
          pairs = [ line.split(':', 1) for line in infile if ':' in line ]
          pairs = [ (domain.strip() , owner.strip())
          for (domain,owner) in pairs ]
          return dict(lines)
          >
          DOMAINOWNERS = loaddomainowner s()
          >
          def lookupdmo(domai n):
          return DOMAINOWNERS[domain]
          Using two list comprehensions mean you construct two lists, which sucks
          if it's a large file.

          Also, you could pass the list comprehension (or better yet a generator
          expression) directly to dict() without saving it to a variable:

          with open('/etc/virtual/domainowners',' r') as fh:
          return dict(line.strip ().split(':', 1) for line in fh)

          (Argh, that doesn't .strip() the key and value, which means it won't
          work, but it's so simple and elegant and I'm tired enough that I'm not
          going to add that. :-P Just use another genexp. Makes for a line
          complicated enough that it could be turned into a for loop, though.)
          --

          Comment

          • Chris

            #6
            Re: File to dict

            Ta Matt, wasn't paying attention to what I typed. :)
            And didn't know that about .get() and not having to declare the
            global.
            Thanks for my mandatory new thing for the day ;)

            Comment

            • Bruno Desthuilliers

              #7
              Re: File to dict

              mrkafk@gmail.co m a écrit :
              Hello everyone,
              (snip)
              Say, I would like to transform a file containing entries like
              the following into a list of lists with doublecolon treated as
              separators, i.e. this:
              >
              tm:$1$aaaa$bbbb :1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin
              >
              would get transformed into this:
              >
              [ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
              tm', '/sbin/nologin'] [...] [...] ]
              The csv module is your friend.

              Comment

              • Matt Nordhoff

                #8
                Re: File to dict

                Chris wrote:
                Ta Matt, wasn't paying attention to what I typed. :)
                And didn't know that about .get() and not having to declare the
                global.
                Thanks for my mandatory new thing for the day ;)
                :-)
                --

                Comment

                • mrkafk@gmail.com

                  #9
                  Re: File to dict


                  Duncan Booth wrote:
                  Just some minor points without changing the basis of what you have done
                  here:
                  All good points, thanks. Phew, there's nothing like peer review for
                  your code...
                  But why do you construct a dict from that input data simply to throw it
                  away?
                  Because comparing strings for equality in a loop is writing C in
                  Python, and that's
                  exactly what I'm trying to unlearn.

                  The proper way to do it is to produce a dictionary and look up a value
                  using a key.
                  >If you only want 1 domain from the file just pick it out of the list.
                  for item in list:
                  if item == 'searched.domai n':
                  return item...

                  Yuck.

                  with open('/etc/virtual/domainowners',' r') as infile:
                  pairs = [ line.split(':', 1) for line in infile if ':' in line ]
                  Didn't think about doing it this way. Good point. Thx

                  Comment

                  • mrkafk@gmail.com

                    #10
                    Re: File to dict


                    The csv module is your friend.
                    (slapping forehead) why the Holy Grail didn't I think about this? That
                    should be much simpler than using SimpleParse or SPARK.

                    Thx Bruno & everyone.

                    Comment

                    • Marc 'BlackJack' Rintsch

                      #11
                      Re: File to dict

                      On Fri, 07 Dec 2007 04:44:25 -0800, mrkafk wrote:
                      Duncan Booth wrote:
                      >But why do you construct a dict from that input data simply to throw it
                      >away?
                      >
                      Because comparing strings for equality in a loop is writing C in
                      Python, and that's exactly what I'm trying to unlearn.
                      >
                      The proper way to do it is to produce a dictionary and look up a value
                      using a key.
                      >
                      >>If you only want 1 domain from the file just pick it out of the list.
                      >
                      for item in list:
                      if item == 'searched.domai n':
                      return item...
                      >
                      Yuck.
                      I guess Duncan's point wasn't the construction of the dictionary but the
                      throw it away part. If you don't keep it, the loop above is even more
                      efficient than building a dictionary with *all* lines of the file, just to
                      pick one value afterwards.

                      Ciao,
                      Marc 'BlackJack' Rintsch

                      Comment

                      • Bruno Desthuilliers

                        #12
                        Re: File to dict

                        mrkafk@gmail.co m a écrit :
                        >
                        >The csv module is your friend.
                        >
                        (slapping forehead) why the Holy Grail didn't I think about this?
                        If that can make you feel better, a few years ago, I spent two days
                        writing my own (SquaredWheel(t m) of course) csv reader/writer... before
                        realizing there was such a thing as the csv module :-/

                        Should have known better...

                        Comment

                        • mrkafk@gmail.com

                          #13
                          Re: File to dict

                          I guess Duncan's point wasn't the construction of the dictionary but the
                          throw it away part. If you don't keep it, the loop above is even more
                          efficient than building a dictionary with *all* lines of the file, just to
                          pick one value afterwards.
                          Sure, but I have two options here, none of them nice: either "write C
                          in Python" or do it inefficient and still elaborate way.

                          Anyway, I found my nirvana at last:
                          >>def shelper(line):
                          .... return x.replace(' ','').strip('\n ').split(':',1)
                          ....
                          >>ownerslist = [ shelper(x)[1] for x in it if len(shelper(x)) == 2 and shelper(x)[0] == domain ]
                          >>ownerslist
                          ['da2']


                          Python rulez. :-)




                          Comment

                          • mrkafk@gmail.com

                            #14
                            Re: File to dict

                            >def shelper(line):
                            ... return x.replace(' ','').strip('\n ').split(':',1)
                            Argh, typo, should be def shelper(x) of course.


                            Comment

                            • Neil Cerutti

                              #15
                              Re: File to dict

                              On 2007-12-07, Duncan Booth <duncan.booth@i nvalid.invalidw rote:
                              from __future__ import with_statement
                              >
                              def loaddomainowner s(domain):
                              with open('/etc/virtual/domainowners',' r') as infile:
                              I've been thinking I have to use contextlib.clos ing for
                              auto-closing files. Is that not so?

                              --
                              Neil Cerutti

                              Comment

                              Working...