Anyone happen to have optimization hints for this loop?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • dp_pearce

    Anyone happen to have optimization hints for this loop?

    I have some code that takes data from an Access database and processes
    it into text files for another application. At the moment, I am using
    a number of loops that are pretty slow. I am not a hugely experienced
    python user so I would like to know if I am doing anything
    particularly wrong or that can be hugely improved through the use of
    another method.

    Currently, all of the values that are to be written to file are pulled
    from the database and into a list called "domainVa". These values
    represent 3D data and need to be written to text files using line
    breaks to seperate 'layers'. I am currently looping through the list
    and appending a string, which I then write to file. This list can
    regularly contain upwards of half a million values...

    count = 0
    dmntString = ""
    for z in range(0, Z):
    for y in range(0, Y):
    for x in range(0, X):
    fraction = domainVa[count]
    dmntString += " "
    dmntString += fraction
    count = count + 1
    dmntString += "\n"
    dmntString += "\n"
    dmntString += "\n***\n

    dmntFile = open(dmntFilena me, 'wt')
    dmntFile.write( dmntString)
    dmntFile.close( )

    I have found that it is currently taking ~3 seconds to build the
    string but ~1 second to write the string to file, which seems wrong (I
    would normally guess the CPU/Memory would out perform disc writing
    speeds).

    Can anyone see a way of speeding this loop up? Perhaps by changing the
    data format? Is it wrong to append a string and write once, or should
    hold a file open and write at each instance?

    Thank you in advance for your time,

    Dan
  • writeson

    #2
    Re: Anyone happen to have optimization hints for this loop?

    On Jul 9, 12:04 pm, dp_pearce <dp_pea...@hotm ail.comwrote:
    I have some code that takes data from an Access database and processes
    it into text files for another application. At the moment, I am using
    a number of loops that are pretty slow. I am not a hugely experienced
    python user so I would like to know if I am doing anything
    particularly wrong or that can be hugely improved through the use of
    another method.
    >
    Currently, all of the values that are to be written to file are pulled
    from the database and into a list called "domainVa". These values
    represent 3D data and need to be written to text files using line
    breaks to seperate 'layers'. I am currently looping through the list
    and appending a string, which I then write to file. This list can
    regularly contain upwards of half a million values...
    >
    count = 0
    dmntString = ""
    for z in range(0, Z):
        for y in range(0, Y):
            for x in range(0, X):
                fraction = domainVa[count]
                dmntString += "  "
                dmntString += fraction
                count = count + 1
            dmntString += "\n"
        dmntString += "\n"
    dmntString += "\n***\n
    >
    dmntFile     = open(dmntFilena me, 'wt')
    dmntFile.write( dmntString)
    dmntFile.close( )
    >
    I have found that it is currently taking ~3 seconds to build the
    string but ~1 second to write the string to file, which seems wrong (I
    would normally guess the CPU/Memory would out perform disc writing
    speeds).
    >
    Can anyone see a way of speeding this loop up? Perhaps by changing the
    data format? Is it wrong to append a string and write once, or should
    hold a file open and write at each instance?
    >
    Thank you in advance for your time,
    >
    Dan
    Hi Dan,

    Looking at the code sample you sent, you could do some clever stuff
    making dmntString a list rather than a string and appending everywhere
    you're doing a +=. Then at the end you build the string your write to
    the file one time with a dmntFile.write( ''.join(dmntLis t). But I think
    the more straightforward thing would be to replace all the dmntString
    += ... lines in the loops with a dmntFile.write( whatever), you're just
    constantly adding onto the file in various ways.

    I think the slowdown you're seeing your code as written comes from
    Python string being immutable. Every time you perform a dmntString
    += ... in the loops you're creating a new dmntString, copying in the
    contents of the old, plus the appended content. And if your list can
    reach a half a million items, well that's a TON of string create,
    string copy operations.

    Hope you find this helpful,
    Doug

    Comment

    • Casey

      #3
      Re: Anyone happen to have optimization hints for this loop?

      On Jul 9, 12:04 pm, dp_pearce <dp_pea...@hotm ail.comwrote:
      I have some code that takes data from an Access database and processes
      it into text files for another application. At the moment, I am using
      a number of loops that are pretty slow. I am not a hugely experienced
      python user so I would like to know if I am doing anything
      particularly wrong or that can be hugely improved through the use of
      another method.
      >
      Currently, all of the values that are to be written to file are pulled
      from the database and into a list called "domainVa". These values
      represent 3D data and need to be written to text files using line
      breaks to seperate 'layers'. I am currently looping through the list
      and appending a string, which I then write to file. This list can
      regularly contain upwards of half a million values...
      >
      count = 0
      dmntString = ""
      for z in range(0, Z):
          for y in range(0, Y):
              for x in range(0, X):
                  fraction = domainVa[count]
                  dmntString += "  "
                  dmntString += fraction
                  count = count + 1
              dmntString += "\n"
          dmntString += "\n"
      dmntString += "\n***\n
      >
      dmntFile     = open(dmntFilena me, 'wt')
      dmntFile.write( dmntString)
      dmntFile.close( )
      >
      I have found that it is currently taking ~3 seconds to build the
      string but ~1 second to write the string to file, which seems wrong (I
      would normally guess the CPU/Memory would out perform disc writing
      speeds).
      >
      Can anyone see a way of speeding this loop up? Perhaps by changing the
      data format? Is it wrong to append a string and write once, or should
      hold a file open and write at each instance?
      >
      Thank you in advance for your time,
      >
      Dan
      Maybe try something like this ...

      count = 0
      dmntList = []
      for z in xrange(Z):
      for y in xrange(Y):
      dmntList.extend ([" "+domainVa[count+x] for x in xrange(X)])
      dmntList.append ("\n")
      count += X
      dmntList.append ("\n")
      dmntList.append ("\n***\n")
      dmntString = ''.join(dmntLis t)

      Comment

      • Paul Hankin

        #4
        Re: Anyone happen to have optimization hints for this loop?

        On Jul 9, 5:04 pm, dp_pearce <dp_pea...@hotm ail.comwrote:
        count = 0
        dmntString = ""
        for z in range(0, Z):
            for y in range(0, Y):
                for x in range(0, X):
                    fraction = domainVa[count]
                    dmntString += "  "
                    dmntString += fraction
                    count = count + 1
                dmntString += "\n"
            dmntString += "\n"
        dmntString += "\n***\n
        >
        dmntFile     = open(dmntFilena me, 'wt')
        dmntFile.write( dmntString)
        dmntFile.close( )
        Can anyone see a way of speeding this loop up?
        I'd consider writing it like this:

        def dmntGenerator() :
        count = 0
        for z in xrange(Z):
        for y in xrange(Y):
        for x in xrange(X):
        yield ' '
        yield domainVa[count]
        count += 1
        yield '\n'
        yield '\n'
        yield '\n***\n'

        You can make the string using ''.join:

        dmntString = ''.join(dmntGen erator())

        But if you don't need the string, just write straight to the file:

        for part in dmntGenerator() :
        dmntFile.write( part)

        This is likely to be a lot faster as no large string is produced.

        --
        Paul Hankin

        Comment

        • bruno.desthuilliers@gmail.com

          #5
          Re: Anyone happen to have optimization hints for this loop?

          On 9 juil, 18:04, dp_pearce <dp_pea...@hotm ail.comwrote:
          I have some code that takes data from an Access database and processes
          it into text files for another application. At the moment, I am using
          a number of loops that are pretty slow. I am not a hugely experienced
          python user so I would like to know if I am doing anything
          particularly wrong or that can be hugely improved through the use of
          another method.
          >
          Currently, all of the values that are to be written to file are pulled
          from the database and into a list called "domainVa". These values
          represent 3D data and need to be written to text files using line
          breaks to seperate 'layers'. I am currently looping through the list
          and appending a string, which I then write to file. This list can
          regularly contain upwards of half a million values...
          >
          count = 0
          dmntString = ""
          for z in range(0, Z):
          for y in range(0, Y):
          for x in range(0, X):
          fraction = domainVa[count]
          dmntString += " "
          dmntString += fraction
          count = count + 1
          dmntString += "\n"
          dmntString += "\n"
          dmntString += "\n***\n
          >
          dmntFile = open(dmntFilena me, 'wt')
          dmntFile.write( dmntString)
          dmntFile.close( )
          >
          I have found that it is currently taking ~3 seconds to build the
          string but ~1 second to write the string to file, which seems wrong (I
          would normally guess the CPU/Memory would out perform disc writing
          speeds).
          Not necessarily - when the dataset becomes too big and your process
          has eaten all available RAM, your OS starts swapping, and then it's
          getting worse than disk IO. IOW, for large datasets (for a definition
          of 'large' depending on the available resources), you're sometimes
          better doing direct disk access - which BTW are usually buffered by
          the OS.
          Can anyone see a way of speeding this loop up? Perhaps by changing the
          data format?
          Almost everyone told you to use a list and str.join()... Which used to
          be a sound advice wrt/ both performances and readability, but nowadays
          "only" makes your code more readable (and pythonic...) :

          bruno@bibi ~ $ python
          Python 2.5.1 (r251:54863, Apr 6 2008, 17:20:35)
          [GCC 4.1.2 (Gentoo 4.1.2 p1.0.2)] on linux2
          Type "help", "copyright" , "credits" or "license" for more information.
          >>from timeit import Timer
          >>def dostr():
          .... s = ''
          .... for i in xrange(10000):
          .... s += ' ' + str(i)
          ....
          >>def dolist():
          .... s = []
          .... for i in xrange(10000):
          .... s.append(str(i) )
          .... s = ' '.join(s)
          ....
          >>tstr = Timer("dostr", "from __main__ import dostr")
          >>tlist = Timer("dolist", "from __main__ import dolist")
          >>tlist.timeit( 10000000)
          1.4280490875244 141
          >>tstr.timeit(1 0000000)
          1.4347598552703 857
          >>>
          The list + str.join version is only marginaly faster... But you should
          consider this solution even if doesn't that change much to perfs -
          readabilty counts, too//
          Is it wrong to append a string and write once, or should
          hold a file open and write at each instance?
          Is it really a matter of one XOR the other ? Perhaps you should try a
          midway solution, ie building not-too-big chunks as lists, and writing
          them to the (opened file) ? This would avoids possible swap and reduce
          disk IO. I suggest you try this approach with different list-size /
          write ratios, using the timeit module (and eventually the "top"
          program on unix or it's equivalent if you're on another platform to
          check memory/CPU usage) to find out which ratio works best for a
          representative sample of your input data. That's at least what I'd
          do...

          HTH

          Comment

          • dp_pearce

            #6
            Re: Anyone happen to have optimization hints for this loop?

            Thank you so much for all your advice. I have learnt a lot.

            In the end, the solution was perhaps self evident. Why try and build a
            huge string AND THEN write it to file when you can just write it to
            file? Writing this much data directly to file completed in ~1.5
            seconds instead of the 3-4 seconds that any of the .join methods
            produced.

            Interestingly, I found that my tests agreed with Bruno. There wasn't a
            huge change in speed between a lot of the other methods, other than
            them looking a lot tidier.

            Again, many thanks.

            Dan

            Comment

            Working...