marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • bkustel@gmail.com

    marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

    I'm stuck on a problem where I want to use marshal for serialization
    (yes, yes, I know (c)Pickle is normally recommended here). I favor
    marshal for speed for the types of data I use.

    However it seems that marshal.dumps() for large objects has a
    quadratic performance issue which I'm assuming is that it grows its
    memory buffer in constant increments. This causes a nasty slowdown for
    marshaling large objects. I thought I would get around this by passing
    a cStringIO.Strin gIO object to marshal.dump() instead but I quickly
    learned this is not supported (only true file objects are supported).

    Any ideas about how to get around the marshal quadratic issue? Any
    hope for a fix for that on the horizon? Thanks for any information.
  • TheSaint

    #2
    Re: marshal.dumps quadratic growth and marshal.dump not allowing file-like objects

    On 16:04, domenica 15 giugno 2008 bkustel@gmail.c om wrote:
    cStringIO.Strin gIO object to marshal.dump() instead but I quickly
    learned this is not supported (only true file objects are supported).
    >
    Any ideas about how to get around the marshal quadratic issue? Any
    hope for a fix for that on the horizon?
    If you zip the cStringIO.Strin gIO object, would it be possible?

    --
    Mailsweeper Home : http://it.geocities.com/call_me_not_now/index.html

    Comment

    • Peter Otten

      #3
      Re: marshal.dumps quadratic growth and marshal.dump not allowing file-like objects

      bkustel@gmail.c om wrote:
      I'm stuck on a problem where I want to use marshal for serialization
      (yes, yes, I know (c)Pickle is normally recommended here). I favor
      marshal for speed for the types of data I use.
      >
      However it seems that marshal.dumps() for large objects has a
      quadratic performance issue which I'm assuming is that it grows its
      memory buffer in constant increments. This causes a nasty slowdown for
      marshaling large objects. I thought I would get around this by passing
      a cStringIO.Strin gIO object to marshal.dump() instead but I quickly
      learned this is not supported (only true file objects are supported).
      >
      Any ideas about how to get around the marshal quadratic issue? Any
      hope for a fix for that on the horizon? Thanks for any information.
      Here's how marshal resizes the string:

      newsize = size + size + 1024;
      if (newsize 32*1024*1024) {
      newsize = size + 1024*1024;
      }

      Maybe you can split your large objects and marshal multiple objects to keep
      the size below the 32MB limit.

      Peter

      Comment

      • Raymond Hettinger

        #4
        Re: marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

        On Jun 15, 1:04 am, bkus...@gmail.c om wrote:
        However it seems that marshal.dumps() for large objects has a
        quadratic performance issue which I'm assuming is that it grows its
        memory buffer in constant increments.
        Looking at the source in http://svn.python.org/projects/pytho...thon/marshal.c
        , it looks like the relevant fragment is in w_more():

        . . .
        size = PyString_Size(p->str);
        newsize = size + size + 1024;
        if (newsize 32*1024*1024) {
        newsize = size + 1024*1024;
        }
        if (_PyString_Resi ze(&p->str, newsize) != 0) {
        . . .

        When more space is needed, the resize operation over-allocates by
        double the previous need plus 1K. This should give amortized O(1)
        performance just like list.append().

        However, when that strategy requests more than 32Mb, the resizing
        becomes less aggressive and grows only in 1MB blocks and giving your
        observed nasty quadratic behavior.

        Raymond

        Comment

        • John Machin

          #5
          Re: marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

          On Jun 15, 7:47 pm, Peter Otten <__pete...@web. dewrote:
          bkus...@gmail.c om wrote:
          I'm stuck on a problem where I want to use marshal for serialization
          (yes, yes, I know (c)Pickle is normally recommended here). I favor
          marshal for speed for the types of data I use.
          >
          However it seems that marshal.dumps() for large objects has a
          quadratic performance issue which I'm assuming is that it grows its
          memory buffer in constant increments. This causes a nasty slowdown for
          marshaling large objects. I thought I would get around this by passing
          a cStringIO.Strin gIO object to marshal.dump() instead but I quickly
          learned this is not supported (only true file objects are supported).
          >
          Any ideas about how to get around the marshal quadratic issue? Any
          hope for a fix for that on the horizon? Thanks for any information.
          >
          Here's how marshal resizes the string:
          >
          newsize = size + size + 1024;
          if (newsize 32*1024*1024) {
          newsize = size + 1024*1024;
          }
          >
          Maybe you can split your large objects and marshal multiple objects to keep
          the size below the 32MB limit.
          >
          But that change went into the svn trunk on 11-May-2008; perhaps the OP
          is using a production release which would have the previous version,
          which is merely "newsize = size + 1024;".

          Do people really generate 32MB pyc files, or is stopping doubling at
          32MB just a safety valve in case someone/something runs amok?

          Cheers,
          John

          Comment

          • Peter Otten

            #6
            Re: marshal.dumps quadratic growth and marshal.dump not allowing file-like objects

            John Machin wrote:
            >Here's how marshal resizes the string:
            >>
            > newsize = size + size + 1024;
            > if (newsize 32*1024*1024) {
            > newsize = size + 1024*1024;
            > }
            >>
            >Maybe you can split your large objects and marshal multiple objects to
            >keep the size below the 32MB limit.
            >>
            >
            But that change went into the svn trunk on 11-May-2008; perhaps the OP
            is using a production release which would have the previous version,
            which is merely "newsize = size + 1024;".
            That is indeed much worse. Depending on what the OP means by "large objects"
            the problem may be fixed in subversion then.
            Do people really generate 32MB pyc files, or is stopping doubling at
            32MB just a safety valve in case someone/something runs amok?
            A 32MB pyc would correspond to a module of roughly the same size. So
            someone/something runs amok in either case.

            Peter

            Comment

            • Christian Heimes

              #7
              Re: marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

              Raymond Hettinger wrote:
              When more space is needed, the resize operation over-allocates by
              double the previous need plus 1K. This should give amortized O(1)
              performance just like list.append().
              >
              However, when that strategy requests more than 32Mb, the resizing
              becomes less aggressive and grows only in 1MB blocks and giving your
              observed nasty quadratic behavior.
              The marshal code has been revamped in Python 2.6. The old code in Python
              2.5 uses a linear growth strategy:

              size = PyString_Size(p->str);
              newsize = size + 1024;
              if (_PyString_Resi ze(&p->str, newsize) != 0) {
              p->ptr = p->end = NULL;
              }

              Anyway marshal should not be used by user code to serialize objects.
              It's only meant for Python byte code. Please use the pickle/cPickle
              module instead.

              Christian

              Comment

              • bkustel@gmail.com

                #8
                Re: marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

                On Jun 15, 3:16 am, John Machin <sjmac...@lexic on.netwrote:
                But that change went into the svn trunk on 11-May-2008; perhaps the OP
                is using a production release which would have the previous version,
                which is merely "newsize = size + 1024;".
                >
                Do people really generate 32MB pyc files, or is stopping doubling at
                32MB just a safety valve in case someone/something runs amok?
                Indeed. I (the OP) am using a production release which has the 1k
                linear growth.
                I am seeing the problems with ~5MB and ~10MB sizes.
                Apparently this will be improved greatly in Python 2.6, at least up to
                the 32MB limit.

                Thanks all for responding.

                Comment

                • John Machin

                  #9
                  Re: marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

                  On Jun 16, 1:08 am, bkus...@gmail.c om wrote:
                  On Jun 15, 3:16 am, John Machin <sjmac...@lexic on.netwrote:
                  >
                  But that change went into the svn trunk on 11-May-2008; perhaps the OP
                  is using a production release which would have the previous version,
                  which is merely "newsize = size + 1024;".
                  >
                  Do people really generate 32MB pyc files, or is stopping doubling at
                  32MB just a safety valve in case someone/something runs amok?
                  >
                  Indeed. I (the OP) am using a production release which has the 1k
                  linear growth.
                  I am seeing the problems with ~5MB and ~10MB sizes.
                  Apparently this will be improved greatly in Python 2.6, at least up to
                  the 32MB limit.
                  Apparently you intend to resist good advice and persist [accidental
                  pun!] with marshal -- how much slower is cPickle for various sizes of
                  data? What kinds of objects are you persisting?

                  Comment

                  • Raymond Hettinger

                    #10
                    Re: marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

                    On Jun 15, 8:08 am, bkus...@gmail.c om wrote:
                    Indeed. I (the OP) am using a production release which has the 1k
                    linear growth.
                    I am seeing the problems with ~5MB and ~10MB sizes.
                    Apparently this will be improved greatly in Python 2.6, at least up to
                    the 32MB limit.
                    I've just fixed this for Py2.5.3 and Py2.6. No more quadratic
                    behavior.


                    Raymond

                    Comment

                    • Aaron Watters

                      #11
                      Re: marshal.dumps quadratic growth and marshal.dump not allowingfile-like objects

                      >
                      Anywaymarshalsh ould not be used by user code to serialize objects.
                      It's only meant for Python byte code. Please use the pickle/cPickle
                      module instead.
                      >
                      Christian
                      Just for yucks let me point out that marshal has
                      no real security concerns of interest to the non-paranoid,
                      whereas pickle is a security disaster waiting to happen
                      unless you are extremely cautious... yet again.

                      Sorry, I know a even a monkey learns after 3 times...

                      -- Aaron Watters

                      ===



                      Comment

                      Working...