Possible to set cpython heap size?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Andy Watson

    Possible to set cpython heap size?

    I have an application that scans and processes a bunch of text files.
    The content I'm pulling out and holding in memory is at least 200MB.

    I'd love to be able to tell the CPython virtual machine that I need a
    heap of, say 300MB up front rather than have it grow as needed. I've
    had a scan through the archives of comp.lang.pytho n and the python
    docs but cannot find a way to do this. Is this possible to configure
    the PVM this way?

    Much appreciated,
    Andy
    --

  • Diez B. Roggisch

    #2
    Re: Possible to set cpython heap size?

    Andy Watson wrote:
    I have an application that scans and processes a bunch of text files.
    The content I'm pulling out and holding in memory is at least 200MB.
    >
    I'd love to be able to tell the CPython virtual machine that I need a
    heap of, say 300MB up front rather than have it grow as needed. I've
    had a scan through the archives of comp.lang.pytho n and the python
    docs but cannot find a way to do this. Is this possible to configure
    the PVM this way?
    Why do you want that? And no, it is not possible. And to be honest: I have
    no idea why e.g. the JVM allows for this.

    Diez

    Comment

    • Andy Watson

      #3
      Re: Possible to set cpython heap size?

      Why do you want that? And no, it is not possible. And to be honest:
      I have
      no idea why e.g. the JVM allows for this.
      >
      Diez
      The reason why is simply that I know roughly how much memory I'm going
      to need, and cpython seems to be taking a fair amount of time
      extending its heap as I read in content incrementally.

      Ta,
      Andy
      --

      Comment

      • Diez B. Roggisch

        #4
        Re: Possible to set cpython heap size?

        Andy Watson wrote:
        Why do you want that? And no, it is not possible. And to be honest:
        I have
        >no idea why e.g. the JVM allows for this.
        >
        The reason why is simply that I know roughly how much memory I'm going
        to need, and cpython seems to be taking a fair amount of time
        extending its heap as I read in content incrementally.
        I'm not an expert in python malloc schemes, I know that _some_ things are
        heavily optimized, but I'm not aware that it does some clever
        self-management of heap in the general case. Which would be complicated in
        the presence of arbitrary C extensions anyway.


        However, I'm having doubts that your observation is correct. A simple

        python -m timeit -n 1 -r 1 "range(50000000 )"
        1 loops, best of 1: 2.38 sec per loop

        will create a python-process of half a gig ram - for a split-second - and I
        don't consider 2.38 seconds a fair amount of time for heap allocation.

        When I used a 4 times larger argument, my machine began swapping. THEN
        things became ugly - but I don't see how preallocation will help there...

        Diez

        Comment

        • Irmen de Jong

          #5
          Re: Possible to set cpython heap size?

          Andy Watson wrote:
          Why do you want that? And no, it is not possible. And to be honest:
          I have
          >no idea why e.g. the JVM allows for this.
          >>
          >Diez
          >
          The reason why is simply that I know roughly how much memory I'm going
          to need, and cpython seems to be taking a fair amount of time
          ^^^^^
          extending its heap as I read in content incrementally.
          First make sure this is really the case.
          It may be that you are just using an inefficient algorithm.
          In my experience allocating extra heap memory is hardly ever
          noticeable. Unless your system is out of physical RAM and has
          to swap.

          --Irmen

          Comment

          • Chris Mellon

            #6
            Re: Possible to set cpython heap size?

            On 22 Feb 2007 09:52:49 -0800, Andy Watson <aldcwatson@gma il.comwrote:
            Why do you want that? And no, it is not possible. And to be honest:
            I have
            no idea why e.g. the JVM allows for this.

            Diez
            >
            The reason why is simply that I know roughly how much memory I'm going
            to need, and cpython seems to be taking a fair amount of time
            extending its heap as I read in content incrementally.
            >
            To my knowledge, no modern OS actually commits any memory at all to a
            process until it is written to. Pre-extending the heap would either a)
            do nothing, because it'd be essentially a noop, or b) would take at
            least long as doing it incrementally (because Python would need to
            fill up all that space with objects), without giving you any actual
            performance gain when you fill the object space "for real".

            In Java, as I understand it, having a fixed size heap allows some
            optimizations in the garbage collector. Pythons GC model is different
            and, as far as I know, is unlikely to benefit from this.

            Comment

            • Jussi Salmela

              #7
              Re: Possible to set cpython heap size?

              Andy Watson kirjoitti:
              I have an application that scans and processes a bunch of text files.
              The content I'm pulling out and holding in memory is at least 200MB.
              >
              I'd love to be able to tell the CPython virtual machine that I need a
              heap of, say 300MB up front rather than have it grow as needed. I've
              had a scan through the archives of comp.lang.pytho n and the python
              docs but cannot find a way to do this. Is this possible to configure
              the PVM this way?
              >
              Much appreciated,
              Andy
              --
              >
              Others have already suggested swap as a possible cause of slowness. I've
              been playing with my portable (dual Intel T2300 @ 1.66 GHz; 1 GB of mem
              ; Win XP ; Python Scripter IDE)
              using the following code:

              #============== =========
              import datetime

              '''
              # Create 10 files with sizes 1MB, ..., 10MB
              for i in range(1,11):
              print 'Writing: ' + 'Bytes_' + str(i*1000000)
              f = open('Bytes_' + str(i*1000000), 'w')
              f.write(str(i-1)*i*1000000)
              f.close()
              '''

              # Read the files 5 times concatenating the contents
              # to one HUGE string
              now_1 = datetime.dateti me.now()
              s = ''
              for count in range(5):
              for i in range(1,11):
              print 'Reading: ' + 'Bytes_' + str(i*1000000)
              f = open('Bytes_' + str(i*1000000), 'r')
              s = s + f.read()
              f.close()
              print 'Size of s is', len(s)
              print 's[274999999] = ' + s[274999999]
              now_2 = datetime.dateti me.now()
              print now_1
              print now_2
              raw_input('???' )
              #============== =========

              The part at the start that is commented out is the part I used to create
              the 10 files. The second part prints the following output (abbreviated):

              Reading: Bytes_1000000
              Size of s is 1000000
              Reading: Bytes_2000000
              Size of s is 3000000
              Reading: Bytes_3000000
              Size of s is 6000000
              Reading: Bytes_4000000
              Size of s is 10000000
              Reading: Bytes_5000000
              Size of s is 15000000
              Reading: Bytes_6000000
              Size of s is 21000000
              Reading: Bytes_7000000
              Size of s is 28000000
              Reading: Bytes_8000000
              Size of s is 36000000
              Reading: Bytes_9000000
              Size of s is 45000000
              Reading: Bytes_10000000
              Size of s is 55000000
              <snip>
              Reading: Bytes_9000000
              Size of s is 265000000
              Reading: Bytes_10000000
              Size of s is 275000000
              s[274999999] = 9
              2007-02-22 20:23:09.984000
              2007-02-22 20:23:21.515000

              As can be seen creating a string of 275 MB reading the parts from the
              files took less than 12 seconds. I think this is fast enough, but others
              might disagree! ;)

              Using the Win Task Manager I can see the process to grow to a little
              less than 282 MB when it reaches the raw_input call and to drop to less
              than 13 MB a little after I've given some input apparently as a result
              of PyScripter doing a GC.

              Your situation (hardware, file sizes etc.) may differ so that my
              experiment does not correspond it, but this was my 2 cents worth!

              HTH,
              Jussi

              Comment

              • Andy Watson

                #8
                Re: Possible to set cpython heap size?

                On Feb 22, 10:53 am, a bunch of folks wrote:
                Memory is basically free.
                This is true if you are simply scanning a file into memory. However,
                I'm storing the contents in some in-memory data structures and doing
                some data manipulation. This is my speculation:

                Several small objects per scanned line get allocated, and then
                unreferenced. If the heap is relatively small, GC has to do some work
                in order to make space for subsequent scan results. At some point, it
                realises it cannot keep up and has to extend the heap. At this point,
                VM and physical memory is committed, since it needs to be used. And
                this keeps going on. At some point, GC will take a good deal of time
                to compact the heap, since I and loading in so much data and creating
                a lot of smaller objects.

                If I could have a heap that is larger and does not need to be
                dynamically extended, then the Python GC could work more efficiently.

                Interesting discussion.
                Cheers,
                Andy
                --

                Comment

                • Chris Mellon

                  #9
                  Re: Possible to set cpython heap size?

                  On 22 Feb 2007 11:28:52 -0800, Andy Watson <aldcwatson@gma il.comwrote:
                  On Feb 22, 10:53 am, a bunch of folks wrote:
                  >
                  Memory is basically free.
                  >
                  This is true if you are simply scanning a file into memory. However,
                  I'm storing the contents in some in-memory data structures and doing
                  some data manipulation. This is my speculation:
                  >
                  Several small objects per scanned line get allocated, and then
                  unreferenced. If the heap is relatively small, GC has to do some work
                  in order to make space for subsequent scan results. At some point, it
                  realises it cannot keep up and has to extend the heap. At this point,
                  VM and physical memory is committed, since it needs to be used. And
                  this keeps going on. At some point, GC will take a good deal of time
                  to compact the heap, since I and loading in so much data and creating
                  a lot of smaller objects.
                  >
                  If I could have a heap that is larger and does not need to be
                  dynamically extended, then the Python GC could work more efficiently.
                  >
                  I haven't even looked at Python memory management internals since 2.3,
                  and not in detail then, so I'm sure someone will correct me in the
                  case that I am wrong.

                  However, I believe that this is almost exactly how CPython GC does not
                  work. CPython is refcounted with a generational GC for cycle
                  detection. There's a memory pool that is used for object allocation
                  (more than one, I think, for different types of objects) and those can
                  be extended but they are not, to my knowledge, compacted.

                  If you're creating the same small objects for each scanned lines, and
                  especially if they are tuples or new-style objects with __slots__,
                  then the memory use for those objects should be more or less constant.
                  Your memory growth is probably related to the information you're
                  saving, not to your scanned objects, and since those are long-lived
                  objects I simple don't see how heap pre-allocation could be helpful
                  there.

                  Comment

                  • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

                    #10
                    Re: Possible to set cpython heap size?

                    Andy Watson schrieb:
                    I have an application that scans and processes a bunch of text files.
                    The content I'm pulling out and holding in memory is at least 200MB.
                    >
                    I'd love to be able to tell the CPython virtual machine that I need a
                    heap of, say 300MB up front rather than have it grow as needed. I've
                    had a scan through the archives of comp.lang.pytho n and the python
                    docs but cannot find a way to do this. Is this possible to configure
                    the PVM this way?
                    You can configure your operating system. On Unix, do 'ulimit -m 200000'.

                    Regards,
                    Martin

                    Comment

                    • Andrew MacIntyre

                      #11
                      Re: Possible to set cpython heap size?

                      Chris Mellon wrote:
                      On 22 Feb 2007 11:28:52 -0800, Andy Watson <aldcwatson@gma il.comwrote:
                      >On Feb 22, 10:53 am, a bunch of folks wrote:
                      >>
                      >>Memory is basically free.
                      >This is true if you are simply scanning a file into memory. However,
                      >I'm storing the contents in some in-memory data structures and doing
                      >some data manipulation. This is my speculation:
                      >>
                      >Several small objects per scanned line get allocated, and then
                      >unreferenced . If the heap is relatively small, GC has to do some work
                      >in order to make space for subsequent scan results. At some point, it
                      >realises it cannot keep up and has to extend the heap. At this point,
                      >VM and physical memory is committed, since it needs to be used. And
                      >this keeps going on. At some point, GC will take a good deal of time
                      >to compact the heap, since I and loading in so much data and creating
                      >a lot of smaller objects.
                      >>
                      >If I could have a heap that is larger and does not need to be
                      >dynamically extended, then the Python GC could work more efficiently.
                      >>
                      >
                      I haven't even looked at Python memory management internals since 2.3,
                      and not in detail then, so I'm sure someone will correct me in the
                      case that I am wrong.
                      >
                      However, I believe that this is almost exactly how CPython GC does not
                      work. CPython is refcounted with a generational GC for cycle
                      detection. There's a memory pool that is used for object allocation
                      (more than one, I think, for different types of objects) and those can
                      be extended but they are not, to my knowledge, compacted.
                      >
                      If you're creating the same small objects for each scanned lines, and
                      especially if they are tuples or new-style objects with __slots__,
                      then the memory use for those objects should be more or less constant.
                      Your memory growth is probably related to the information you're
                      saving, not to your scanned objects, and since those are long-lived
                      objects I simple don't see how heap pre-allocation could be helpful
                      there.
                      Python's internal memory management is split:
                      - allocations up to 256 bytes (the majority of objects) are handled by
                      a custom allocator, which uses 256kB arenas malloc()ed from the OS on
                      demand. With 2.5 some additional work was done to allow returning
                      completely empty arenas to the OS; 2.3 and 2.4 don't return arenas at
                      all.
                      - all allocations over 256 bytes, including container objects that are
                      extended beyond 256 bytes, are made by malloc().

                      I can't recall off-hand whether the free-list structures for ints (and
                      floats?) use the Python allocator or direct malloc(); as the free-lists
                      don't release any entries, I suspect not.

                      The maximum allocation size and arena size used by the Python allocator
                      are hard-coded for algorithmic and performance reasons, and cannot be
                      practically be changed, especially at runtime. No active compaction
                      takes place in arenas, even with GC. The only time object data is
                      relocated between arenas is when an object is resized.

                      If Andy Watson is creating loads of objects that aren't being managed
                      by Python's allocator (by being larger than 256 bytes, or in a type
                      free-list), then the platform malloc() behaviour applies. Some platform
                      allocators can be tuned via environment variables and the like, in which
                      case review of the platform documentation is indicated.

                      Some platform allocators are notorious for poor behaviour in certain
                      circumstances, and coalescing blocks while deallocating is one
                      particularly nasty problem for code that creates and destroys lots
                      of small variably sized objects.

                      --
                      -------------------------------------------------------------------------
                      Andrew I MacIntyre "These thoughts are mine alone..."
                      E-mail: andymac@bullsey e.apana.org.au (pref) | Snail: PO Box 370
                      andymac@pcug.or g.au (alt) | Belconnen ACT 2616
                      Web: http://www.andymac.org/ | Australia

                      Comment

                      • Tony Nelson

                        #12
                        Re: Possible to set cpython heap size?

                        In article <1172172532.503 432.223650@v33g 2000cwv.googleg roups.com>,
                        "Andy Watson" <aldcwatson@gma il.comwrote:

                        ...
                        If I could have a heap that is larger and does not need to be
                        dynamically extended, then the Python GC could work more efficiently.
                        ...

                        GC! If you're allocating lots of objects and holding on to them, GC
                        will run frequently, but won't find anything to free. Maybe you want to
                        turn off GC, at least some of the time? See the GC module, esp.
                        set_threshold() .

                        Note that the cyclic GC is only really a sort of safety net for
                        reference loops, as normally objects are free'd when their last
                        reference is lost.
                        _______________ _______________ _______________ _______________ ____________
                        TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
                        ' <http://www.georgeanels on.com/>

                        Comment

                        Working...