Processing huge datasets

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Anders Søndergaard

    Processing huge datasets

    Hi,

    I'm trying to process a large filesystem (+20 million files) and keep the
    directories along with summarized information about the files (sizes,
    modification times, newest file and the like) in an instance hierarchy
    in memory. I read the information from a Berkeley Database.

    I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
    operate on recursively.

    First I banged my head on the recursion limit, which could luckily be
    adjusted.
    Now I simply get MemoryError.

    Is there a clever way of processing huge datasets in Python?
    How would a smart Python programmer advance the problem?

    I'm looking at rewriting the code to operate on parts of the hierarchy at a
    time and store the processed data structure in another Berkeley DB so I can
    query that afterwards. But I'd really prefer keeping all things in memory
    due to the huge performance gain.

    Any pointers?

    Cheers, Anders
  • Aahz

    #2
    Re: Processing huge datasets

    In article <79Knc.15953$k4 .322398@news1.n okia.com>,
    Anders =?UTF-8?B?U8O4bmRlcmd hYXJk?= <anders.soender gaard@nokia.com > wrote:[color=blue]
    >
    >I'm trying to process a large filesystem (+20 million files) and keep the
    >directories along with summarized information about the files (sizes,
    >modification times, newest file and the like) in an instance hierarchy
    >in memory. I read the information from a Berkeley Database.
    >
    >I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
    >operate on recursively.
    >
    >First I banged my head on the recursion limit, which could luckily be
    >adjusted.[/color]

    Well, Don't Do That. ;-)

    I don't understand the data structure you're describing; you'll either
    need to choose something more appropriate for recursive processing or
    switch to iterative processing (probably using generators).
    --
    Aahz (aahz@pythoncra ft.com) <*> http://www.pythoncraft.com/

    Adopt A Process -- stop killing all your children!

    Comment

    • Thomas Guettler

      #3
      Re: Processing huge datasets

      Am Mon, 10 May 2004 12:00:03 +0000 schrieb Anders Søndergaard:
      [color=blue]
      > Hi,
      >
      > I'm trying to process a large filesystem (+20 million files) and keep the
      > directories along with summarized information about the files (sizes,
      > modification times, newest file and the like) in an instance hierarchy
      > in memory. I read the information from a Berkeley Database.
      >
      > I'm keeping it in a Left-Child-Right-Sibling instance structure, that I
      > operate on recursively.
      >
      > First I banged my head on the recursion limit, which could luckily be
      > adjusted.
      > Now I simply get MemoryError.
      >
      > Is there a clever way of processing huge datasets in Python?
      > How would a smart Python programmer advance the problem?[/color]

      Hi Anders,

      I use ZODB.


      HTH,
      Thomas

      Comment

      • Dennis Lee Bieber

        #4
        Re: Processing huge datasets

        On Mon, 10 May 2004 12:00:03 GMT, Anders Søndergaard
        <anders.soender gaard@nokia.com > declaimed the following in
        comp.lang.pytho n:
        [color=blue]
        > Hi,
        >
        > I'm trying to process a large filesystem (+20 million files) and keep the
        > directories along with summarized information about the files (sizes,
        > modification times, newest file and the like) in an instance hierarchy
        > in memory. I read the information from a Berkeley Database.
        >[/color]
        Assuming the modification time is a 32-bit integer, and sizes
        are a 32-bit integer, you've got 8-bytes per file right there... or
        160MB just for the timestamp/size for your 20million files... Add in
        overhead for the data structures themselves (are you also keeping file
        names? Even a fixed 8.3 format -- no count/terminator/"." -- will add
        another 220MB, giving 380MB without overhead).

        --[color=blue]
        > =============== =============== =============== =============== == <
        > wlfraed@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
        > wulfraed@dm.net | Bestiaria Support Staff <
        > =============== =============== =============== =============== == <
        > Home Page: <http://www.dm.net/~wulfraed/> <
        > Overflow Page: <http://wlfraed.home.ne tcom.com/> <[/color]

        Comment

        • Terry Reedy

          #5
          Re: Processing huge datasets


          "Anders Søndergaard" <anders.soender gaard@nokia.com > wrote in message
          news:79Knc.1595 3$k4.322398@new s1.nokia.com...[color=blue]
          > I'm trying to process a large filesystem (+20 million files) and keep the
          > directories along with summarized information about the files (sizes,
          > modification times, newest file and the like) in an instance hierarchy
          > in memory. I read the information from a Berkeley Database.[/color]
          [color=blue]
          > Is there a clever way of processing huge datasets in Python?
          > How would a smart Python programmer advance the problem?[/color]

          I would start with 2 gigs of RAM, which would allow about 90 bytes per
          entry after allowing 200 megs for os and interpreter. Even that might not
          be enough.

          tjr




          Comment

          • Anders S. Jensen

            #6
            Re: Processing huge datasets

            Aahz <aahz@pythoncra ft.com> wrote:[color=blue]
            > Well, Don't Do That. ;-)
            >
            > I don't understand the data structure you're describing; you'll either
            > need to choose something more appropriate for recursive processing or
            > switch to iterative processing (probably using generators).[/color]

            Let me explain the problem a bit more in detail.
            (Sorry for confusing things by using my private email address)

            I'm trying to produce a system that will analyze a filesystem
            for 'dead leafs', large data consumption, pr. UID volume consumption,
            trend analysis, etc.

            To do that I first save each directory along with sum of C, M and A
            timestamps for the files in the directory (all in epoch secs.), number
            of files, volume of files, list of UID's, list of GID's, dict of volume
            pr. UID, dict of volume pr. GID and newest file.

            Then I build a hierarchy of instances that knows it's parents, children
            and siblings.
            Each object is populated with the summarized file information.
            When that is done, I traverse the hierarchy from the bottom up,
            accumulating average C, M and A times, Volumes and number of files.

            This hierarchy allows me to instantly query, say, the average
            modification time for any given point in the directory structure and
            below. That'll show where files that havent been modified in a long time
            hides, and how much space they take amongst other things.

            The LCRS tree lends itself very well for recursion in terms of beauty
            and elegance of the code. However keeping that amount of data in memory
            obviously just doesn't fly.

            I'm at a point where the system actually just *barely* might work. I'll
            know tomorrow when it's done. But the system might be used to work on
            much larger filesystems, and then the party is over.

            I'm looking for the best suited way of attacking the problem. Is it
            a memory map file? A Berkeley DB? A specially crafted metadata filesystem?
            (that would be fun but probably overkill..:) or something completely
            different.

            The processing might be switched to iterative, but that's a 'minor' concern.
            The main problem is how to handle the data in the fastest possible way.

            Thanks for your pointers!

            Cheers,
            Anders

            Comment

            • William Park

              #7
              Re: Processing huge datasets

              Anders S. Jensen <doozer@freakou t.dk> wrote:[color=blue]
              > I'm trying to produce a system that will analyze a filesystem for
              > 'dead leafs', large data consumption, pr. UID volume consumption,
              > trend analysis, etc.
              >
              > To do that I first save each directory along with sum of C, M and A
              > timestamps for the files in the directory (all in epoch secs.), number
              > of files, volume of files, list of UID's, list of GID's, dict of
              > volume pr. UID, dict of volume pr. GID and newest file.
              >
              > Then I build a hierarchy of instances that knows it's parents,
              > children and siblings. Each object is populated with the summarized
              > file information. When that is done, I traverse the hierarchy from
              > the bottom up, accumulating average C, M and A times, Volumes and
              > number of files.
              >
              > This hierarchy allows me to instantly query, say, the average
              > modification time for any given point in the directory structure and
              > below. That'll show where files that havent been modified in a long
              > time hides, and how much space they take amongst other things.
              >
              > The LCRS tree lends itself very well for recursion in terms of beauty
              > and elegance of the code. However keeping that amount of data in
              > memory obviously just doesn't fly.
              >
              > I'm at a point where the system actually just *barely* might work.
              > I'll know tomorrow when it's done. But the system might be used to
              > work on much larger filesystems, and then the party is over.
              >
              > I'm looking for the best suited way of attacking the problem. Is it a
              > memory map file? A Berkeley DB? A specially crafted metadata
              > filesystem? (that would be fun but probably overkill..:) or something
              > completely different.[/color]

              How about good old textfile in filesystem, ie.
              /some/dir/.timestamp
              ? This way, you don't use up memory, you get hierarchial data structure
              automatically, no need to mess with database interface, simple search
              mechanism, etc. And, most of all, no need to ask other people who have
              absolutely no idea what you're talking about... :-)
              [color=blue]
              >
              > The processing might be switched to iterative, but that's a 'minor'
              > concern. The main problem is how to handle the data in the fastest
              > possible way.
              >
              > Thanks for your pointers!
              >
              > Cheers, Anders[/color]

              --
              William Park, Open Geometry Consulting, <opengeometry@y ahoo.ca>
              Linux solution/training/migration, Thin-client

              Comment

              Working...