efficient data loading with Python, is that possible possible?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • igor.tatarinov@gmail.com

    efficient data loading with Python, is that possible possible?

    Hi, I am pretty new to Python and trying to use it for a relatively
    simple problem of loading a 5 million line text file and converting it
    into a few binary files. The text file has a fixed format (like a
    punchcard). The columns contain integer, real, and date values. The
    output files are the same values in binary. I have to parse the values
    and write the binary tuples out into the correct file based on a given
    column. It's a little more involved but that's not important.

    I have a C++ prototype of the parsing code and it loads a 5 Mline file
    in about a minute. I was expecting the Python version to be 3-4 times
    slower and I can live with that. Unfortunately, it's 20 times slower
    and I don't see how I can fix that.

    The fundamental difference is that in C++, I create a single object (a
    line buffer) that's reused for each input line and column values are
    extracted straight from that buffer without creating new string
    objects. In python, new objects must be created and destroyed by the
    million which must incur serious memory management overhead.

    Correct me if I am wrong but

    1) for line in file: ...
    will create a new string object for every input line

    2) line[start:end]
    will create a new string object as well

    3) int(time.mktime (time.strptime( s, "%m%d%y%H%M%S") ))
    will create 10 objects (since struct_time has 8 fields)

    4) a simple test: line[i:j] + line[m:n] in hash
    creates 3 strings and there is no way to avoid that.

    I thought arrays would help but I can't load an array without creating
    a string first: ar(line, start, end) is not supported.

    I hope I am missing something. I really like Python but if there is no
    way to process data efficiently, that seems to be a problem.

    Thanks,
    igor

  • George Sakkis

    #2
    Re: efficient data loading with Python, is that possible possible?

    On Dec 12, 5:48 pm, igor.tatari...@ gmail.com wrote:
    Hi, I am pretty new to Python and trying to use it for a relatively
    simple problem of loading a 5 million line text file and converting it
    into a few binary files. The text file has a fixed format (like a
    punchcard). The columns contain integer, real, and date values. The
    output files are the same values in binary. I have to parse the values
    and write the binary tuples out into the correct file based on a given
    column. It's a little more involved but that's not important.
    >
    I have a C++ prototype of the parsing code and it loads a 5 Mline file
    in about a minute. I was expecting the Python version to be 3-4 times
    slower and I can live with that. Unfortunately, it's 20 times slower
    and I don't see how I can fix that.
    >
    The fundamental difference is that in C++, I create a single object (a
    line buffer) that's reused for each input line and column values are
    extracted straight from that buffer without creating new string
    objects. In python, new objects must be created and destroyed by the
    million which must incur serious memory management overhead.
    >
    Correct me if I am wrong but
    >
    1) for line in file: ...
    will create a new string object for every input line
    >
    2) line[start:end]
    will create a new string object as well
    >
    3) int(time.mktime (time.strptime( s, "%m%d%y%H%M%S") ))
    will create 10 objects (since struct_time has 8 fields)
    >
    4) a simple test: line[i:j] + line[m:n] in hash
    creates 3 strings and there is no way to avoid that.
    >
    I thought arrays would help but I can't load an array without creating
    a string first: ar(line, start, end) is not supported.
    >
    I hope I am missing something. I really like Python but if there is no
    way to process data efficiently, that seems to be a problem.
    20 times slower because of garbage collection sounds kinda fishy.
    Posting some actual code usually helps; it's hard to tell for sure
    otherwise.

    George

    Comment

    • igor.tatarinov@gmail.com

      #3
      Re: efficient data loading with Python, is that possible possible?

      On Dec 12, 4:03 pm, John Machin <sjmac...@lexic on.netwrote:
      Inside your function
      [you are doing all this inside a function, not at global level in a
      script, aren't you?], do this:
      from time import mktime, strptime # do this ONCE
      ...
      blahblah = int(mktime(strp time(s, "%m%d%y%H%M%S") ))
      >
      It would help if you told us what platform, what version of Python,
      how much memory, how much swap space, ...
      >
      Cheers,
      John
      I am using a global 'from time import ...'. I will try to do that
      within the
      function and see if it makes a difference.

      The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
      something like that. Python 2.4

      Here is some of my code. Tell me what's wrong with it :)

      def loadFile(inputF ile, loader):
      # .zip files don't work with zlib
      f = popen('zcat ' + inputFile)
      for line in f:
      loader.handleLi ne(line)
      ...

      In Loader class:
      def handleLine(self , line):
      # filter out 'wrong' lines
      if not self._dataForma t(line): return

      # add a new output record
      rec = self.result.add Record()

      for col in self._dataForma t.colFormats:
      value = parseValue(line , col)
      rec[col.attr] = value

      And here is parseValue (will using a hash-based dispatch make it much
      faster?):

      def parseValue(line , col):
      s = line[col.start:col.e nd+1]
      # no switch in python
      if col.format == ColumnFormat.DA TE:
      return Format.parseDat e(s)
      if col.format == ColumnFormat.UN SIGNED:
      return Format.parseUns igned(s)
      if col.format == ColumnFormat.ST RING:
      # and-or trick (no x ? y:z in python 2.4)
      return not col.strip and s or rstrip(s)
      if col.format == ColumnFormat.BO OLEAN:
      return s == col.arg and 'Y' or 'N'
      if col.format == ColumnFormat.PR ICE:
      return Format.parseUns igned(s)/100.

      And here is Format.parseDat e() as an example:
      def parseDate(s):
      # missing (infinite) value ?
      if s.startswith('9 99999') or s.startswith('0 00000'): return -1
      return int(mktime(strp time(s, "%y%m%d")))

      Hopefully, this should be enough to tell what's wrong with my code.

      Thanks again,
      igor

      Comment

      • John Machin

        #4
        Re: efficient data loading with Python, is that possible possible?

        On Dec 13, 11:44 am, igor.tatari...@ gmail.com wrote:
        On Dec 12, 4:03 pm, John Machin <sjmac...@lexic on.netwrote:
        >
        Inside your function
        [you are doing all this inside a function, not at global level in a
        script, aren't you?], do this:
        from time import mktime, strptime # do this ONCE
        ...
        blahblah = int(mktime(strp time(s, "%m%d%y%H%M%S") ))
        >
        It would help if you told us what platform, what version of Python,
        how much memory, how much swap space, ...
        >
        Cheers,
        John
        >
        I am using a global 'from time import ...'. I will try to do that
        within the
        function and see if it makes a difference.
        >
        The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
        something like that. Python 2.4
        >
        Here is some of my code. Tell me what's wrong with it :)
        >
        def loadFile(inputF ile, loader):
        # .zip files don't work with zlib
        f = popen('zcat ' + inputFile)
        for line in f:
        loader.handleLi ne(line)
        ...
        >
        In Loader class:
        def handleLine(self , line):
        # filter out 'wrong' lines
        if not self._dataForma t(line): return
        >
        # add a new output record
        rec = self.result.add Record()
        >
        for col in self._dataForma t.colFormats:
        value = parseValue(line , col)
        rec[col.attr] = value
        >
        And here is parseValue (will using a hash-based dispatch make it much
        faster?):
        >
        def parseValue(line , col):
        s = line[col.start:col.e nd+1]
        # no switch in python
        if col.format == ColumnFormat.DA TE:
        return Format.parseDat e(s)
        if col.format == ColumnFormat.UN SIGNED:
        return Format.parseUns igned(s)
        if col.format == ColumnFormat.ST RING:
        # and-or trick (no x ? y:z in python 2.4)
        return not col.strip and s or rstrip(s)
        if col.format == ColumnFormat.BO OLEAN:
        return s == col.arg and 'Y' or 'N'
        if col.format == ColumnFormat.PR ICE:
        return Format.parseUns igned(s)/100.
        >
        And here is Format.parseDat e() as an example:
        def parseDate(s):
        # missing (infinite) value ?
        if s.startswith('9 99999') or s.startswith('0 00000'): return -1
        return int(mktime(strp time(s, "%y%m%d")))
        >
        Hopefully, this should be enough to tell what's wrong with my code.
        >
        I have to go out now, so here's a quick overview: too many goddam dots
        and too many goddam method calls.
        1. do
        colfmt = col.format # ONCE
        if colfmt == ...
        2. No switch so put most frequent at the top
        3. What is ColumnFormat? What is Format? I think you have gone class-
        crazy, and there's more overhead than working code ...

        Cheers,
        John

        Comment

        • Steven D'Aprano

          #5
          Re: efficient data loading with Python, is that possible possible?

          On Wed, 12 Dec 2007 14:48:03 -0800, igor.tatarinov wrote:
          Hi, I am pretty new to Python and trying to use it for a relatively
          simple problem of loading a 5 million line text file and converting it
          into a few binary files. The text file has a fixed format (like a
          punchcard). The columns contain integer, real, and date values. The
          output files are the same values in binary. I have to parse the values
          and write the binary tuples out into the correct file based on a given
          column. It's a little more involved but that's not important.
          I suspect that this actually is important, and that your slowdown has
          everything to do with the stuff you dismiss and nothing to do with
          Python's object model or execution speed.

          I have a C++ prototype of the parsing code and it loads a 5 Mline file
          in about a minute. I was expecting the Python version to be 3-4 times
          slower and I can live with that. Unfortunately, it's 20 times slower and
          I don't see how I can fix that.
          I've run a quick test on my machine with a mere 1GB of RAM, reading the
          entire file into memory at once, and then doing some quick processing on
          each line:

          >>def make_big_file(n ame, size=5000000):
          .... fp = open(name, 'w')
          .... for i in xrange(size):
          .... fp.write('here is a bunch of text with a newline\n')
          .... fp.close()
          ....
          >>make_big_file ('BIG')
          >>>
          >>def test(name):
          .... import time
          .... start = time.time()
          .... fp = open(name, 'r')
          .... for line in fp.readlines():
          .... line = line.strip()
          .... words = line.split()
          .... fp.close()
          .... return time.time() - start
          ....
          >>test('BIG')
          22.531502008438 11

          Twenty two seconds to read five million lines and split them into words.
          I suggest the other nineteen minutes and forty-odd seconds your code is
          taking has something to do with your code and not Python's execution
          speed.

          Of course, I wouldn't normally read all 5M lines into memory in one big
          chunk. Replace the code

          for line in fp.readlines():

          with

          for line in fp:

          and the time drops from 22 seconds to 16.



          --
          Steven

          Comment

          • DouhetSukd

            #6
            Re: efficient data loading with Python, is that possible possible?

            Back about 8 yrs ago, on pc hardware, I was reading twin 5 Mb files
            and doing a 'fancy' diff between the 2, in about 60 seconds. Granted,
            your file is likely bigger, but so is modern hardware and 20 mins does
            seem a bit high.

            Can't talk about the rest of your code, but some parts of it may be
            optimized

            def parseValue(line , col):
            s = line[col.start:col.e nd+1]
            # no switch in python
            if col.format == ColumnFormat.DA TE:
            return Format.parseDat e(s)
            if col.format == ColumnFormat.UN SIGNED:
            return Format.parseUns igned(s)

            How about taking the big if clause out? That would require making all
            the formatters into functions, rather than in-lining some of them, but
            it may clean things up.

            #prebuilding a lookup of functions vs. expected formats...
            #This is done once.
            #Remember, you have to position this dict's computation _after_ all
            the Format.parseXXX declarations. Don't worry, Python _will_ complain
            if you don't.

            dict_format_fun c = {ColumnFormat.D ATE:Format.pars eDate,
            ColumnFormat.UN SIGNED:Format.p arseUnsigned,
            ....

            def parseValue(line , col):
            s = line[col.start:col.e nd+1]

            #get applicable function, apply it to s
            return dict_format_fun c[col.format](s)

            Also...

            if col.format == ColumnFormat.ST RING:
            # and-or trick (no x ? y:z in python 2.4)
            return not col.strip and s or rstrip(s)

            Watch out! 'col.strip' here is not the result of stripping the
            column, it is the strip _function_ itself, bound to the col object, so
            it always be true. I get caught by those things all the time :-(

            I agree that taking out the dot.dot.dots would help, but I wouldn't
            expect it to matter that much, unless it was in an incredibly tight
            loop.

            I might be that.

            if s.startswith('9 99999') or s.startswith('0 00000'): return -1

            would be better as...

            #outside of loop, define a set of values for which you want to return
            -1
            set_return = set(['999999','00000 0'])

            #lookup first 6 chars in your set
            def parseDate(s):
            if s[0:6] in set_return:
            return -1
            return int(mktime(strp time(s, "%y%m%d")))

            Bottom line: Python built-in data objects, such as dictionaries and
            sets, are very much optimized. Relying on them, rather than writing a
            lot of ifs and doing weird data structure manipulations in Python
            itself, is a good approach to try. Try to build those objects outside
            of your main processing loops.

            Cheers

            Douhet-did-suck

            Comment

            • Steven D'Aprano

              #7
              Re: efficient data loading with Python, is that possible possible?

              On Wed, 12 Dec 2007 16:44:01 -0800, igor.tatarinov wrote:
              Here is some of my code. Tell me what's wrong with it :)
              >
              def loadFile(inputF ile, loader):
              # .zip files don't work with zlib
              Pardon?
              f = popen('zcat ' + inputFile)
              for line in f:
              loader.handleLi ne(line)
              Do you really need to compress the file? Five million lines isn't a lot.
              It depends on the length of each line, naturally, but I'd be surprised if
              it were more than 100MB.
              ...
              >
              In Loader class:
              def handleLine(self , line):
              # filter out 'wrong' lines
              if not self._dataForma t(line): return

              Who knows what the _dataFormat() method does? How complicated is it? Why
              is it a private method?

              # add a new output record
              rec = self.result.add Record()
              Who knows what this does? How complicated it is?

              for col in self._dataForma t.colFormats:
              Hmmm... a moment ago, _dataFormat seemed to be a method, or at least a
              callable. Now it has grown a colFormats attribute. Complicated and
              confusing.

              value = parseValue(line , col)
              rec[col.attr] = value
              >
              And here is parseValue (will using a hash-based dispatch make it much
              faster?):
              Possibly, but not enough to reduce 20 minutes to one or two.

              But you know something? Your code looks like a bad case of over-
              generalisation. I assume it's a translation of your C++ code -- no wonder
              it takes an entire minute to process the file! (Oh lord, did I just say
              that???) Object-oriented programming is a useful tool, but sometimes you
              don't need a HyperDispatcher LoaderManagerCr eator, you just need a hammer.

              In your earlier post, you gave the data specification:

              "The text file has a fixed format (like a punchcard). The columns contain
              integer, real, and date values. The output files are the same values in
              binary."

              Easy-peasy. First, some test data:


              fp = open('BIG', 'w')
              for i in xrange(5000000) :
              anInt = i % 3000
              aBool = ['TRUE', 'YES', '1', 'Y', 'ON',
              'FALSE', 'NO', '0', 'N', 'OFF'][i % 10]
              aFloat = ['1.12', '-3.14', '0.0', '7.42'][i % 4]
              fp.write('%s %s %s\n' % (anInt, aBool, aFloat))
              if i % 45000 == 0:
              # Write a comment and a blank line.
              fp.write('# this is a comment\n \n')

              fp.close()



              Now let's process it:


              import struct

              # Define converters for each type of value to binary.
              def fromBool(s):
              """String to boolean byte."""
              s = s.upper()
              if s in ('TRUE', 'YES', '1', 'Y', 'ON'):
              return struct.pack('b' , True)
              elif s in ('FALSE', 'NO', '0', 'N', 'OFF'):
              return struct.pack('b' , False)
              else:
              raise ValueError('not a valid boolean')

              def fromInt(s):
              """String to integer bytes."""
              return struct.pack('l' , int(s))

              def fromFloat(s):
              """String to floating point bytes."""
              return struct.pack('f' , float(s))


              # Assume three fields...
              DEFAULT_FORMAT = [fromInt, fromBool, fromFloat]

              # And three files...
              OUTPUT_FILES = ['ints.out', 'bools.out', 'floats.out']


              def process_line(s, format=DEFAULT_ FORMAT):
              s = s.strip()
              fields = s.split() # I assume the fields are whitespace separated
              assert len(fields) == len(format)
              return [f(x) for (x, f) in zip(fields, format)]

              def process_file(in file, outfiles=OUTPUT _FILES):
              out = [open(f, 'wb') for f in outfiles]
              for line in file(infile, 'r'):
              # ignore leading/trailing whitespace and comments
              line = line.strip()
              if line and not line.startswith ('#'):
              fields = process_line(li ne)
              # now write the fields to the files
              for x, fp in zip(fields, out):
              fp.write(x)
              for f in out:
              f.close()



              And now let's use it and see how long it takes:
              >>import time
              >>s = time.time(); process_file('B IG'); time.time() - s
              129.58465385437 012


              Naturally if your converters are more complex (e.g. date-time), or if you
              have more fields, it will take longer to process, but then I've made no
              effort at all to optimize the code.



              --
              Steven.

              Comment

              • bearophileHUGS@lycos.com

                #8
                Re: efficient data loading with Python, is that possible possible?

                igor:
                The fundamental difference is that in C++, I create a single object (a
                line buffer) that's reused for each input line and column values are
                extracted straight from that buffer without creating new string
                objects. In python, new objects must be created and destroyed by the
                million which must incur serious memory management overhead.
                Python creates indeed many objects (as I think Tim once said "it
                allocates memory at a ferocious rate"), but the management of memory
                is quite efficient. And you may use the JIT Psyco (that's currently
                1000 times more useful than PyPy, despite sadly not being developed
                anymore) that in some situations avoids data copying (example: in
                slices). Python is designed for string processing, and from my
                experience string processing Psyco programs may be faster than similar
                not-optimized-to-death C++/D programs (you can see that manually
                crafted code, or from ShedSkin that's often slower than Psyco during
                string processing). But in every language I know to gain performance
                you need to know the language, and Python isn't C++, so other kinds of
                tricks are necessary.

                The following advice is useful too:

                DouhetSukd:
                >Bottom line: Python built-in data objects, such as dictionaries and
                sets, are very much optimized. Relying on them, rather than writing a
                lot of ifs and doing weird data structure manipulations in Python
                itself, is a good approach to try. Try to build those objects outside
                of your main processing loops.<

                Bye,
                bearophile

                Comment

                • Neil Cerutti

                  #9
                  Re: efficient data loading with Python, is that possible possible?

                  On 2007-12-13, igor.tatarinov@ gmail.com <igor.tatarinov @gmail.comwrote :
                  On Dec 12, 4:03 pm, John Machin <sjmac...@lexic on.netwrote:
                  >Inside your function
                  >[you are doing all this inside a function, not at global level in a
                  >script, aren't you?], do this:
                  > from time import mktime, strptime # do this ONCE
                  > ...
                  > blahblah = int(mktime(strp time(s, "%m%d%y%H%M%S") ))
                  >>
                  >It would help if you told us what platform, what version of Python,
                  >how much memory, how much swap space, ...
                  >>
                  >Cheers,
                  >John
                  >
                  I am using a global 'from time import ...'. I will try to do that
                  within the
                  function and see if it makes a difference.
                  >
                  The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
                  something like that. Python 2.4
                  >
                  Here is some of my code. Tell me what's wrong with it :)
                  >
                  def loadFile(inputF ile, loader):
                  # .zip files don't work with zlib
                  f = popen('zcat ' + inputFile)
                  for line in f:
                  loader.handleLi ne(line)
                  ...
                  >
                  In Loader class:
                  def handleLine(self , line):
                  # filter out 'wrong' lines
                  if not self._dataForma t(line): return
                  >
                  # add a new output record
                  rec = self.result.add Record()
                  >
                  for col in self._dataForma t.colFormats:
                  value = parseValue(line , col)
                  rec[col.attr] = value
                  >
                  def parseValue(line , col):
                  s = line[col.start:col.e nd+1]
                  # no switch in python
                  if col.format == ColumnFormat.DA TE:
                  return Format.parseDat e(s)
                  if col.format == ColumnFormat.UN SIGNED:
                  return Format.parseUns igned(s)
                  if col.format == ColumnFormat.ST RING:
                  # and-or trick (no x ? y:z in python 2.4)
                  return not col.strip and s or rstrip(s)
                  if col.format == ColumnFormat.BO OLEAN:
                  return s == col.arg and 'Y' or 'N'
                  if col.format == ColumnFormat.PR ICE:
                  return Format.parseUns igned(s)/100.
                  >
                  And here is Format.parseDat e() as an example:
                  def parseDate(s):
                  # missing (infinite) value ?
                  if s.startswith('9 99999') or s.startswith('0 00000'): return -1
                  return int(mktime(strp time(s, "%y%m%d")))
                  An inefficient parsing technique is probably to blame. You first
                  inspect the line to make sure it is valid, then you inspect it
                  (number of column type) times to discover what data type it
                  contains, and then you inspect it *again* to finally translate
                  it.
                  And here is parseValue (will using a hash-based dispatch make
                  it much faster?):
                  Not much.

                  You should be able to validate, recognize and translate all in
                  one pass. Get pyparsing to help, if need be.

                  What does your data look like?

                  --
                  Neil Cerutti

                  Comment

                  • Chris Gonnerman

                    #10
                    Re: [Python] Re: efficient data loading with Python, is that possiblepossibl e?

                    Neil Cerutti wrote:
                    An inefficient parsing technique is probably to blame. You first
                    inspect the line to make sure it is valid, then you inspect it
                    (number of column type) times to discover what data type it
                    contains, and then you inspect it *again* to finally translate
                    it.
                    >
                    I was thinking just that. It is much more "pythonic" to simply attempt
                    to convert the values in whatever fashion they are supposed to be
                    converted, and handle errors in data format by means of exceptions.
                    IMO, of course. In the "trivial" case, where there are no errors in the
                    data file, this is a heck of a lot faster.

                    -- Chris.

                    Comment

                    Working...