10GB XML Blows out Memory, Suggestions?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • axwack@gmail.com

    #16
    Re: 10GB XML Blows out Memory, Suggestions?

    The file is an XML dump from Goldmine. I have built a document parser
    that allows for the population of data from Goldmine into SugarCRM. The
    clients data se is 10gb.

    Felipe Almeida Lessa wrote:[color=blue]
    > Em Ter, 2006-06-06 às 13:56 +0000, Paul McGuire escreveu:[color=green]
    > > (just can't open it up like a text file)[/color]
    >
    > Who'll open a 10 GiB file anyway?
    >
    > --
    > Felipe.[/color]

    Comment

    • Fredrik Lundh

      #17
      Re: 10GB XML Blows out Memory, Suggestions?

      gregarican wrote:
      [color=blue]
      > 10 gigs? Wow, even using SAX I would imagine that you would be pushing
      > the limits of reasonable performance.[/color]

      depends on how you define "reasonable ", of course. modern computers are
      quite fast:
      [color=blue]
      > dir data.xml[/color]

      2006-06-06 21:35 1 002 000 015 data.xml
      1 File(s) 1 002 000 015 bytes
      [color=blue]
      > more test.py[/color]
      from xml.etree import cElementTree as ET
      import time

      t0 = time.time()

      for event, elem in ET.iterparse("d ata.xml"):
      if elem.tag == "item":
      elem.clear()

      print time.time() - t0

      gives me timings between 27.1 and 49.1 seconds over 5 runs.

      (Intel Dual Core T2300, slow laptop disks, 1000000 XML "item" elements
      averaging 1000 byte each, bundled cElementTree, peak memory usage 33 MB.
      your milage may vary.)

      </F>

      Comment

      • axwack@gmail.com

        #18
        Re: 10GB XML Blows out Memory, Suggestions?

        Paul,

        This is interesting. Unfortunately, I have no control over the XML
        output. The file is from Goldmine. However, you have given me an
        idea...

        Is it possible to read an XML document in compressed format?
        Paul McGuire wrote:[color=blue]
        > <axwack@gmail.c om> wrote in message
        > news:1149594519 .098115.8980@u7 2g2000cwu.googl egroups.com...[color=green]
        > > I wrote a program that takes an XML file into memory using Minidom. I
        > > found out that the XML document is 10gb.
        > >
        > > I clearly need SAX or something else?
        > >[/color]
        >
        > You clearly need something instead of XML.
        >
        > This sounds like a case where a prototype, which worked for the developer's
        > simple test data set, blows up in the face of real user/production data.
        > XML adds lots of overhead for nested structures, when in fact, the actual
        > meat of the data can be relatively small. Note also that this XML overhead
        > is directly related to the verbosity of the XML designer's choice of tag
        > names, and whether the designer was predisposed to using XML elements over
        > attributes. Imagine a record structure for a 3D coordinate point (described
        > here in no particular coding language):
        >
        > struct ThreeDimPoint:
        > xValue : integer,
        > yValue : integer,
        > zValue : integer
        >
        > Directly translated to XML gives:
        >
        > <ThreeDimPoin t>
        > <xValue>4</xValue>
        > <yValue>5</yValue>
        > <zValue>6</zValue>
        > </ThreeDimPoint>
        >
        > This expands 3 integers to a whopping 101 characters. Throw in namespaces
        > for good measure, and you inflate the data even more.
        >
        > Many Java folks treat XML attributes as anathema, but look how this cuts
        > down the data inflation:
        >
        > <ThreeDimPoin t xValue="4" yValue="5" zValue="6"/>
        >
        > This is only 50 characters, or *only* 4 times the size of the contained data
        > (assuming 4-byte integers).
        >
        > Try zipping your 10Gb file, and see what kind of compression you get - I'll
        > bet it's close to 30:1. If so, convert the data to a real data storage
        > medium. Even a SQLite database table should do better, and you can ship it
        > around just like a file (just can't open it up like a text file).
        >
        > -- Paul[/color]

        Comment

        • gregarican

          #19
          Re: 10GB XML Blows out Memory, Suggestions?

          That a good sized Goldmine database. In past lives I have supported
          that app and recall that you could match the Goldmine front end against
          an SQL backend. If you can get to the underlying data utilizing SQL you
          can selectively port over sections of the database and might be able to
          attack things more methodically than parsing through a mongo XML file.
          Instead you could bulk insert portions of the Goldmine data into
          SugarCRM. Know what I mean?

          axwack@gmail.co m wrote:[color=blue]
          > The file is an XML dump from Goldmine. I have built a document parser
          > that allows for the population of data from Goldmine into SugarCRM. The
          > clients data se is 10gb.
          >
          > Felipe Almeida Lessa wrote:[color=green]
          > > Em Ter, 2006-06-06 às 13:56 +0000, Paul McGuire escreveu:[color=darkred]
          > > > (just can't open it up like a text file)[/color]
          > >
          > > Who'll open a 10 GiB file anyway?
          > >
          > > --
          > > Felipe.[/color][/color]

          Comment

          • John J. Lee

            #20
            Re: 10GB XML Blows out Memory, Suggestions?

            "K.S.Sreera m" <sreeram@tachyo ntech.net> writes:
            [...][color=blue]
            > There's just NO WAY that the 10gb xml file can be loaded into memory as
            > a tree on any normal machine, irrespective of whether we use C or
            > Python.[/color]

            Yes.
            [color=blue]
            > So the *only* way is to perform some kind of 'stream' processing
            > on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
            > out for this.[/color]

            No, that's not true. I guess you didn't read the other posts:



            [color=blue]
            > Diez B. Roggisch wrote:[color=green]
            > > No what exactly makes C grok a 10Gb file where python will fail to do so?[/color]
            >
            > In most typical cases where there's any kind of significant python code,
            > its possible to achieve a *minimum* of a 10x speedup by using C. In most[/color]
            [...]

            I don't know where you got that from. And in this particular case, of
            course, cElementTree *is* written in C, there's presumably plenty of
            "significan t python code" around since, one assumes, *all* of the OP's
            code is written in Python (does that count as "any kind" of Python
            code?), and yet rewriting something in C here may not make much
            difference.


            John

            Comment

            • fuzzylollipop

              #21
              Re: 10GB XML Blows out Memory, Suggestions?


              K.S.Sreeram wrote:[color=blue]
              > Diez B. Roggisch wrote:[color=green]
              > > What the OP needs is a different approach to XML-documents that won't
              > > parse the whole file into one giant tree - but I'm pretty sure that
              > > (c)ElementTree will do the job as well as expat. And I don't recall the
              > > OP musing about performances woes, btw.[/color]
              >
              >
              > There's just NO WAY that the 10gb xml file can be loaded into memory as
              > a tree on any normal machine, irrespective of whether we use C or
              > Python. So the *only* way is to perform some kind of 'stream' processing
              > on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
              > out for this.
              >
              > Diez B. Roggisch wrote:[color=green]
              > > No what exactly makes C grok a 10Gb file where python will fail to do so?[/color]
              >
              > In most typical cases where there's any kind of significant python code,
              > its possible to achieve a *minimum* of a 10x speedup by using C. In most
              > cases, the speedup is not worth it and we just trade it for the
              > increased flexiblity/power of the python language. But in this situation
              > using a bit of tight C code could make the difference between the
              > process taking just 15mins or taking a few hours!
              >
              > Ofcourse I'm not asking him to write the entire application in C. It
              > makes sense to just write the performance critical sections in C, and
              > wrap it in Python, and write the rest of the application in Python.[/color]


              you got no idea what you are talking about, anyone knows that something
              like this is IO bound.
              CPU is the least of his worries. And for IO bound applications Python
              is just as fast as any other language.

              Comment

              • fuzzylollipop

                #22
                Re: 10GB XML Blows out Memory, Suggestions?


                axwack@gmail.co m wrote:[color=blue]
                > Paul,
                >
                > This is interesting. Unfortunately, I have no control over the XML
                > output. The file is from Goldmine. However, you have given me an
                > idea...
                >
                > Is it possible to read an XML document in compressed format?[/color]

                compressing the footprint on disk won't matter, you still have 10GB of
                data that you need to process and it can only be processed
                uncompressed.

                I would just export the data in smaller batches, there should not be
                any reason you can't export subsets and process them that way.

                Comment

                • Fredrik Lundh

                  #23
                  Re: 10GB XML Blows out Memory, Suggestions?

                  fuzzylollipop wrote:
                  [color=blue]
                  > you got no idea what you are talking about, anyone knows that something
                  > like this is IO bound.[/color]

                  which of course explains why some XML parsers for Python are a 100 times
                  faster than other XML parsers for Python...

                  </F>

                  Comment

                  • Fredrik Lundh

                    #24
                    Re: 10GB XML Blows out Memory, Suggestions?

                    fuzzylollipop wrote:
                    [color=blue][color=green]
                    >> Is it possible to read an XML document in compressed format?[/color]
                    >
                    > compressing the footprint on disk won't matter, you still have 10GB of
                    > data that you need to process and it can only be processed uncompressed.[/color]

                    didn't you just claim that this was an I/O bound problem ?

                    </F>

                    Comment

                    • Fredrik Lundh

                      #25
                      Re: 10GB XML Blows out Memory, Suggestions?

                      axwack@gmail.co m wrote:[color=blue]
                      > Paul,
                      >
                      > This is interesting. Unfortunately, I have no control over the XML
                      > output. The file is from Goldmine. However, you have given me an
                      > idea...
                      >
                      > Is it possible to read an XML document in compressed format?[/color]

                      sure. you can e.g. use gzip.open to create a file object that
                      decompresses on the way in.

                      file = gzip.open("data .xml.gz")

                      for event, elem in ET.iterparse(fi le):
                      if elem.tag == "item":
                      elem.clear()

                      I tried compressing my 1 GB example, but all 1000-byte records in that
                      file are identical, so I got a 500x compression, which is a bit higher
                      than you can reasonably expect ;-) however, with that example, I get a
                      stable parsing time of 26 seconds, so it looks as if gzip can produce
                      data about as fast as a preloaded disk cache...

                      </F>

                      Comment

                      • gregarican

                        #26
                        Re: 10GB XML Blows out Memory, Suggestions?

                        Am I missing something? I don't read where the poster mentioned the
                        operation as being CPU intensive. He does mention that the entirety of
                        a 10 GB file cannot be loaded into memory. If you discount physical
                        swapfile paging and base this assumption on a "normal" PC that might
                        have maybe 1 or 2 GB of RAM is his assumption that out of line?

                        And I don't doubt that Python is efficient as possible for I/O
                        operations. But since it is an interpreted scripting language how could
                        it be "just as fast as any language" as you claim? C would have to be
                        faster. Machine language would have to be faster. And even other
                        interpreted languages *could* be faster, given certain conditions. A
                        generalization like the claim kind of invalidates the remainder of your
                        assertion.

                        fuzzylollipop wrote:[color=blue]
                        > K.S.Sreeram wrote:[color=green]
                        > > Diez B. Roggisch wrote:[color=darkred]
                        > > > What the OP needs is a different approach to XML-documents that won't
                        > > > parse the whole file into one giant tree - but I'm pretty sure that
                        > > > (c)ElementTree will do the job as well as expat. And I don't recall the
                        > > > OP musing about performances woes, btw.[/color]
                        > >
                        > >
                        > > There's just NO WAY that the 10gb xml file can be loaded into memory as
                        > > a tree on any normal machine, irrespective of whether we use C or
                        > > Python. So the *only* way is to perform some kind of 'stream' processing
                        > > on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
                        > > out for this.
                        > >
                        > > Diez B. Roggisch wrote:[color=darkred]
                        > > > No what exactly makes C grok a 10Gb file where python will fail to do so?[/color]
                        > >
                        > > In most typical cases where there's any kind of significant python code,
                        > > its possible to achieve a *minimum* of a 10x speedup by using C. In most
                        > > cases, the speedup is not worth it and we just trade it for the
                        > > increased flexiblity/power of the python language. But in this situation
                        > > using a bit of tight C code could make the difference between the
                        > > process taking just 15mins or taking a few hours!
                        > >
                        > > Ofcourse I'm not asking him to write the entire application in C. It
                        > > makes sense to just write the performance critical sections in C, and
                        > > wrap it in Python, and write the rest of the application in Python.[/color]
                        >
                        >
                        > you got no idea what you are talking about, anyone knows that something
                        > like this is IO bound.
                        > CPU is the least of his worries. And for IO bound applications Python
                        > is just as fast as any other language.[/color]

                        Comment

                        • fuzzylollipop

                          #27
                          Re: 10GB XML Blows out Memory, Suggestions?


                          Fredrik Lundh wrote:[color=blue]
                          > fuzzylollipop wrote:
                          >[color=green]
                          > > you got no idea what you are talking about, anyone knows that something
                          > > like this is IO bound.[/color]
                          >
                          > which of course explains why some XML parsers for Python are a 100 times
                          > faster than other XML parsers for Python...
                          >[/color]

                          dependes on the CODE and the SIZE of the file, in this case

                          processing 10GB of file, unless that file is heavly encrypted or
                          compressed will, the process will be IO bound PERIOD!

                          And in the case of XML unless the PARSER is extremely inefficient, and
                          I assume, that would be an edge case, the parser is NOT the bottle neck
                          in this case.

                          The relativel performance of Python XML parsers is irrelvant in
                          relationship to this being an IO bound process, even the slowest parser
                          could only process the data as fast as it can be read off the disk.

                          Anyone saying that using C instead of Python will be faster when 99% of
                          the time in this case is just waiting on the disk to feed a buffer, has
                          no idea what they are talking about.

                          I work with TeraBytes of files, and all our Python code is just as fast
                          as equivelent C code for IO bound processes.

                          Comment

                          • axwack@gmail.com

                            #28
                            Re: 10GB XML Blows out Memory, Suggestions?

                            Thanks guys for all your posts...

                            So I am a bit confused....Fuz zy, the code I saw looks like it
                            decompresses as a stream (i.e. per byte). Is this the case or are you
                            just compressing for file storage but the actual data set has to be
                            exploded in memory?

                            fuzzylollipop wrote:[color=blue]
                            > Fredrik Lundh wrote:[color=green]
                            > > fuzzylollipop wrote:
                            > >[color=darkred]
                            > > > you got no idea what you are talking about, anyone knows that something
                            > > > like this is IO bound.[/color]
                            > >
                            > > which of course explains why some XML parsers for Python are a 100 times
                            > > faster than other XML parsers for Python...
                            > >[/color]
                            >
                            > dependes on the CODE and the SIZE of the file, in this case
                            >
                            > processing 10GB of file, unless that file is heavly encrypted or
                            > compressed will, the process will be IO bound PERIOD!
                            >
                            > And in the case of XML unless the PARSER is extremely inefficient, and
                            > I assume, that would be an edge case, the parser is NOT the bottle neck
                            > in this case.
                            >
                            > The relativel performance of Python XML parsers is irrelvant in
                            > relationship to this being an IO bound process, even the slowest parser
                            > could only process the data as fast as it can be read off the disk.
                            >
                            > Anyone saying that using C instead of Python will be faster when 99% of
                            > the time in this case is just waiting on the disk to feed a buffer, has
                            > no idea what they are talking about.
                            >
                            > I work with TeraBytes of files, and all our Python code is just as fast
                            > as equivelent C code for IO bound processes.[/color]

                            Comment

                            • Diez B. Roggisch

                              #29
                              Re: 10GB XML Blows out Memory, Suggestions?

                              fuzzylollipop wrote:
                              [color=blue]
                              >
                              > Fredrik Lundh wrote:[color=green]
                              >> fuzzylollipop wrote:
                              >>[color=darkred]
                              >> > you got no idea what you are talking about, anyone knows that something
                              >> > like this is IO bound.[/color]
                              >>
                              >> which of course explains why some XML parsers for Python are a 100 times
                              >> faster than other XML parsers for Python...
                              >>[/color]
                              >
                              > dependes on the CODE and the SIZE of the file, in this case
                              >
                              > processing 10GB of file, unless that file is heavly encrypted or
                              > compressed will, the process will be IO bound PERIOD![/color]

                              Why so? IO-bounds will be hit when the processing of the fetched data is
                              faster than the fetching itself. So if I decide to read 10GB a 4Kb block
                              per second, I'm possibly a very patient fella, but no IO-bounds are hit. So
                              no PERIOD here - without talking about _what_ actually happens.
                              [color=blue]
                              > Anyone saying that using C instead of Python will be faster when 99% of
                              > the time in this case is just waiting on the disk to feed a buffer, has
                              > no idea what they are talking about.[/color]

                              Which is true - but the chances for C performing whatever I want to in the
                              1% of time are a few times better than to do so in Python.

                              Mind you: I don't argue that the statements of Mr. Sreeram are true, either.
                              This discussion can only be hold with respect to the actual use case (which
                              is certainly more that just parsing XML, but also processing it)
                              [color=blue]
                              > I work with TeraBytes of files, and all our Python code is just as fast
                              > as equivelent C code for IO bound processes.[/color]

                              Care to share what kind of processing you perfrom on these files?

                              Regards,

                              Diez

                              Comment

                              • gregarican

                                #30
                                Re: 10GB XML Blows out Memory, Suggestions?

                                Point for Fredrik. If someone doesn't recognize the inherent
                                performance differences between different XML parsers they haven't
                                experienced the pain (and eventual victory) of trying to optimize their
                                techniques for working with the albatross that XML can be :-)

                                Fredrik Lundh wrote:[color=blue]
                                > fuzzylollipop wrote:
                                >[color=green]
                                > > dependes on the CODE and the SIZE of the file, in this case
                                > > processing 10GB of file, unless that file is heavly encrypted or
                                > > compressed will, the process will be IO bound PERIOD![/color]
                                >
                                > so the fact that
                                >
                                > for token, node in pulldom.parse(f ile):
                                > pass
                                >
                                > is 50-200% slower than
                                >
                                > for event, elem in ET.iterparse(fi le):
                                > if elem.tag == "item":
                                > elem.clear()
                                >
                                > when reading a gigabyte-sized XML file, is due to an unexpected slowdown
                                > in the I/O subsystem after importing xml.dom?
                                >[color=green]
                                > > I work with TeraBytes of files, and all our Python code is just as fast
                                > > as equivelent C code for IO bound processes.[/color]
                                >
                                > so how large are the things that you're actually *processing* in your
                                > Python code? megabyte blobs or 100-1000 byte records? or even smaller
                                > things?
                                >
                                > </F>[/color]

                                Comment

                                Working...