Re: dynamic allocation file buffer

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Steven D'Aprano

    Re: dynamic allocation file buffer

    On Tue, 09 Sep 2008 14:59:19 -0700, castironpi wrote:
    I will try my idea again. I want to talk to people about a module I
    want to write and I will take the time to explain it. I think it's a
    "cool idea" that a lot of people, forgiving the slang, could benefit
    from. What are its flaws?
    [snip long description with not-very-credible use-cases]

    You've created a solution to a problem which (probably) only affects a
    very small number of people, at least judging by your use-cases. Who has
    a 4GB XML file, and how much crack did they smoke?

    Castironpi, what do *you* use this proof-of-concept module for? Don't
    bother tell us what you think *we* should use it for. Tell us what you're
    using it for, or at least what somebody else is using it for. If this is
    just a module that you think will be cool, I don't like your chances of
    people caring. There is no shortage of "cool" software that isn't useful
    for anything, and unlike eye-candy, nobody is going to use your module
    just because they like the algorithm.

    If you don't have an existing application for the software, then explain
    what it does (not how) and give some idea of the performance ("it's alpha
    and written in Python and really slow, but I will re-write it in C and
    expect it to make a billion random accesses in a 10GB file per
    millisecond", or whatever). You might be lucky and have somebody say
    "Hey, that's just the tool I need to solve my problem!".


    --
    Steven
  • castironpi

    #2
    Re: dynamic allocation file buffer

    On Sep 9, 5:58 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com .auwrote:
    On Tue, 09 Sep 2008 14:59:19 -0700, castironpi wrote:
    I will try my idea again.  I want to talk to people about a module I
    want to write and I will take the time to explain it.  I think it's a
    "cool idea" that a lot of people, forgiving the slang, could benefit
    from.  What are its flaws?
    >
    [snip long description with not-very-credible use-cases]
    Steven,
    You've created a solution to a problem which (probably) only affects a
    very small number of people, at least judging by your use-cases. Who has
    a 4GB XML file, and how much crack did they smoke?
    I judge from the existence of 'shelve' and 'pickle' modules, and
    relational database packages, that the problem I am addressing is not
    rare. It could be the millionaire investor across the street, the
    venture capitalist down the hall, or the guy with a huge CD catalog.
    Castironpi, what do *you* use this proof-of-concept module for?
    Honestly, nothing yet. I just wrote it. My user community and
    customer base are very small. Originally, I wanted to store variable-
    length strings in a file, where shelves and databases were overkill.
    I created it for its beauty, sorry to disappoint.
    Don't
    bother tell us what you think *we* should use it for. Tell us what you're
    using it for, or at least what somebody else is using it for. If this is
    just a module that you think will be cool, I don't like your chances of
    people caring. There is no shortage of "cool" software that isn't useful
    for anything, and unlike eye-candy, nobody is going to use your module
    just because they like the algorithm.
    Unfortunately, nobody is going to care about most of the uses I have
    for it 'til I have a job. I'm goofing around with a laptop,
    remembering when my databases professor kept dropping the ball on
    VARCHARs. If you want a sound byte, think, "imagine programming
    without 'new' and 'malloc'."
    If you don't have an existing application for the software, then explain
    what it does (not how) and give some idea of the performance ("it's alpha
    and written in Python and really slow, but I will re-write it in C and
    expect it to make a billion random accesses in a 10GB file per
    millisecond", or whatever). You might be lucky and have somebody say
    "Hey, that's just the tool I need to solve my problem!".
    I wrote a Rope implementation just to test drive it. It exceeded the
    native immutable string type at 2 megs. It used 'struct' instead of
    'ctypes', so that number could conceivably come down. I am intending
    to leave it in pure Python, so there.
    --
    Steven
    Pleasure chatting as always sir.

    Comment

    • Fredrik Lundh

      #3
      Re: dynamic allocation file buffer

      Steven D'Aprano wrote:
      You've created a solution to a problem which (probably) only affects a
      very small number of people, at least judging by your use-cases. Who has
      a 4GB XML file
      Getting 4GB XML files from, say, logging processes or databases that can
      render their output as XML is not that uncommon. They're usually
      record-oriented, and are intended to be processed as streams. And given
      the right tools, doing that is no harder than doing the same to a 4GB
      text file.

      </F>

      Comment

      • Steven D'Aprano

        #4
        Re: dynamic allocation file buffer

        On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
        Steven D'Aprano wrote:
        >
        >You've created a solution to a problem which (probably) only affects a
        >very small number of people, at least judging by your use-cases. Who
        >has a 4GB XML file
        >
        Getting 4GB XML files from, say, logging processes or databases that can
        render their output as XML is not that uncommon. They're usually
        record-oriented, and are intended to be processed as streams. And given
        the right tools, doing that is no harder than doing the same to a 4GB
        text file.

        Fair enough, that's a good point.

        But would you expect random access to a 4GB XML file? If I've understood
        what Castironpi is trying for, his primary use case was for people
        wanting exactly that.


        --
        Steven

        Comment

        • Aaron \Castironpi\ Brady

          #5
          Re: dynamic allocation file buffer

          On Sep 10, 5:24 am, Steven D'Aprano
          <ste...@REMOVE. THIS.cybersourc e.com.auwrote:
          On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
          Steven D'Aprano wrote:
          >
          You've created a solution to a problem which (probably) only affects a
          very small number of people, at least judging by your use-cases. Who
          has a 4GB XML file
          >
          Getting 4GB XML files from, say, logging processes or databases that can
          render their output as XML is not that uncommon.  They're usually
          record-oriented, and are intended to be processed as streams.  And given
          the right tools, doing that is no harder than doing the same to a 4GB
          text file.
          >
          Fair enough, that's a good point.
          >
          But would you expect random access to a 4GB XML file? If I've understood
          what Castironpi is trying for, his primary use case was for people
          wanting exactly that.
          >
          --
          Steven
          Steven,

          Are you claiming that sequential storage is sufficient for small
          amounts of data, and relational db.s are necessary for large amounts?
          It's possible that there is only the fringe exception, in which case
          'alloc/free' aren't useful in the majority of cases, and will never
          win customers away from the more mature competition.

          Regardless, it is an elegant solution to the problem of storing
          variable-length strings, with hardly any practical value. Perfect for
          grad school.

          Comment

          • Steven D'Aprano

            #6
            Re: dynamic allocation file buffer

            On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi \" Brady wrote:
            On Sep 10, 5:24 am, Steven D'Aprano
            <ste...@REMOVE. THIS.cybersourc e.com.auwrote:
            >On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
            Steven D'Aprano wrote:
            >>
            >You've created a solution to a problem which (probably) only affects
            >a very small number of people, at least judging by your use-cases.
            >Who has a 4GB XML file
            >>
            Getting 4GB XML files from, say, logging processes or databases that
            can render their output as XML is not that uncommon.  They're usually
            record-oriented, and are intended to be processed as streams.  And
            given the right tools, doing that is no harder than doing the same to
            a 4GB text file.
            >>
            >Fair enough, that's a good point.
            >>
            >But would you expect random access to a 4GB XML file? If I've
            >understood what Castironpi is trying for, his primary use case was for
            >people wanting exactly that.
            >>
            >--
            >Steven
            >
            Steven,
            >
            Are you claiming that sequential storage is sufficient for small amounts
            of data, and relational db.s are necessary for large amounts?
            I'm no longer *claiming* anything, I'm *asking* whether random access to
            a 4GB XML file is something that is credible or useful. It is my
            understanding that XML is particularly ill-suited to random access once
            the amount of data is too large to fit in RAM.

            I'm interested in what Fredrik has to say about this, as he's the author
            of ElementTree.



            --
            Steven

            Comment

            • Fredrik Lundh

              #7
              Re: dynamic allocation file buffer

              Steven D'Aprano wrote:
              I'm no longer *claiming* anything, I'm *asking* whether random access to
              a 4GB XML file is something that is credible or useful. It is my
              understanding that XML is particularly ill-suited to random access once
              the amount of data is too large to fit in RAM.
              An XML file doesn't contain any indexing information, so random access
              to a large XML file is very inefficient. You can build (or precompute)
              index information and store in a separate file, of course, but that's
              hardly something that's useful in the general case.

              And as I said before, the only use case for *huge* XML files I've ever
              seen used in practice is to store large streams of record-style data;
              data that's intended to be consumed by sequential processes (and you can
              do a lot with sequential processing these days; for those interested in
              this, digging up a few review papers on "data stream processing" might
              be a good way to waste some time).

              Document-style XML usually fits into memory on modern machines;
              structures larger than that are usually split into different parts (e.g.
              using XInclude) and stored in a container file.

              Random *modifications* to an arbitrary XML file cannot be done, as long
              as you store the file in a standard file system. And if you invent your
              own format, it's no longer an XML file.

              </F>

              Comment

              • Paul Boddie

                #8
                Re: dynamic allocation file buffer

                On 11 Sep, 10:34, Fredrik Lundh <fred...@python ware.comwrote:
                >
                And as I said before, the only use case for *huge* XML files I've ever
                seen used in practice is to store large streams of record-style data;
                I can imagine that the manipulation of the persistent form of large
                graph structures might be another use case, although for efficient
                navigation of such a structure, which is what you'd need to start
                applying various graph algorithms, one would need some kind of index.
                Certainly, we're straying into database territory.

                Paul

                Comment

                • Aaron \Castironpi\ Brady

                  #9
                  Re: dynamic allocation file buffer

                  On Sep 11, 2:40 am, Steven D'Aprano
                  <ste...@REMOVE. THIS.cybersourc e.com.auwrote:
                  On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi \" Brady wrote:
                  On Sep 10, 5:24 am, Steven D'Aprano
                  <ste...@REMOVE. THIS.cybersourc e.com.auwrote:
                  On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
                  Steven D'Aprano wrote:
                  >
                  You've created a solution to a problem which (probably) only affects
                  a very small number of people, at least judging by your use-cases.
                  Who has a 4GB XML file
                  >
                  Getting 4GB XML files from, say, logging processes or databases that
                  can render their output as XML is not that uncommon.  They're usually
                  record-oriented, and are intended to be processed as streams.  And
                  given the right tools, doing that is no harder than doing the same to
                  a 4GB text file.
                  >
                  Fair enough, that's a good point.
                  >
                  But would you expect random access to a 4GB XML file? If I've
                  understood what Castironpi is trying for, his primary use case was for
                  people wanting exactly that.
                  >
                  --
                  Steven
                  >
                  Steven,
                  >
                  Are you claiming that sequential storage is sufficient for small amounts
                  of data, and relational db.s are necessary for large amounts?
                  >
                  I'm no longer *claiming* anything, I'm *asking* whether random access to
                  a 4GB XML file is something that is credible or useful. It is my
                  understanding that XML is particularly ill-suited to random access once
                  the amount of data is too large to fit in RAM.
                  >
                  I'm interested in what Fredrik has to say about this, as he's the author
                  of ElementTree.
                  >
                  --
                  Steven
                  XML is the wrong word for the example I was thinking of (as was
                  already pointed out in another thread). XML is by definition
                  sequential. The use case pertained to a generic element hierarchy;
                  think of 4GB of hierarchical data.

                  Comment

                  • Aaron \Castironpi\ Brady

                    #10
                    Re: dynamic allocation file buffer

                    On Sep 11, 5:35 am, Paul Boddie <p...@boddie.or g.ukwrote:
                    On 11 Sep, 10:34, Fredrik Lundh <fred...@python ware.comwrote:
                    >
                    >
                    >
                    And as I said before, the only use case for *huge* XML files I've ever
                    seen used in practice is to store large streams of record-style data;
                    >
                    I can imagine that the manipulation of the persistent form of large
                    graph structures might be another use case, although for efficient
                    navigation of such a structure, which is what you'd need to start
                    applying various graph algorithms, one would need some kind of index.
                    Certainly, we're straying into database territory.
                    >
                    Paul
                    An acquaintance suggests that defragmentation would be a useful
                    service to provide along with memory management too, which also
                    requires an index.

                    I encourage overlap between a bare-bones alloc/free module and
                    established database territory and I'm very aware of it.

                    Databases already support both concurrency and persistence, but don't
                    tell me you'd use a database for IPC. And don't tell me you've never
                    wished you had a reference to a record in a table so that you could
                    make an update just by changing one word of memory at the right
                    place. Sometimes databases are overkill where all you want is dynamic
                    allocation.

                    Comment

                    • Paul Boddie

                      #11
                      Re: dynamic allocation file buffer

                      On 11 Sep, 19:31, "Aaron \"Castironpi \" Brady" <castiro...@gma il.com>
                      wrote:
                      >
                      An acquaintance suggests that defragmentation would be a useful
                      service to provide along with memory management too, which also
                      requires an index.
                      I presume that you mean efficient access to large amounts of data in
                      the sense that if all the data you want happens to be in the same page
                      or segment, then retrieving it is much more efficient than having to
                      seek around for all the different pieces. So the defragmentation would
                      be what they call clustering in a relational database context:



                      I've seen similar phenomena outside the relational database world,
                      notably with big Lucene indexes which wouldn't fit in memory in their
                      entirety.
                      I encourage overlap between a bare-bones alloc/free module and
                      established database territory and I'm very aware of it.
                      >
                      Databases already support both concurrency and persistence, but don't
                      tell me you'd use a database for IPC.
                      Of course, databases are widely used in scalable systems to hold
                      central state, which is why there's a lot of effort put into to not
                      only scaling up database installations, but also into things like
                      caching which are supposed to save the database systems behind popular
                      Web applications from excessive load.
                      And don't tell me you've never
                      wished you had a reference to a record in a table so that you could
                      make an update just by changing one word of memory at the right
                      place. Sometimes databases are overkill where all you want is dynamic
                      allocation.
                      I think that the challenge is to reduce an abstract operation (for
                      example, wanting to update a particular column in a particular record)
                      to its measurable effects (this word of memory/disk will change as a
                      consequence). It's easy for a human with a reasonable knowledge of,
                      say, a relational database system to anticipate such things, but to
                      actually collapse a number of layers through some kind of generic
                      optimisation process is a lot more difficult.

                      Paul

                      Comment

                      • Steven D'Aprano

                        #12
                        Re: dynamic allocation file buffer

                        On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi \" Brady wrote:
                        XML is the wrong word for the example I was thinking of (as was already
                        pointed out in another thread). XML is by definition sequential.
                        I'm pretty sure you're wrong. XML can be used for serialization, but that
                        doesn't mean it is only sequential data. XML is suitable for hierarchical
                        data too. To quote Wikipedia:

                        "As long as only well-formedness is required, XML is a generic framework
                        for storing any amount of text or any data whose structure can be
                        represented as a tree. The only indispensable syntactical requirement is
                        that the document has exactly one root element (alternatively called the
                        document element)."






                        --
                        Steven

                        Comment

                        • Aaron \Castironpi\ Brady

                          #13
                          Re: dynamic allocation file buffer

                          On Sep 11, 10:37 pm, Steven D'Aprano
                          <ste...@REMOVE. THIS.cybersourc e.com.auwrote:
                          On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi \" Brady wrote:
                          XML is the wrong word for the example I was thinking of (as was already
                          pointed out in another thread).  XML is by definition sequential.
                          >
                          I'm pretty sure you're wrong. XML can be used for serialization, but that
                          doesn't mean it is only sequential data. XML is suitable for hierarchical
                          data too. To quote Wikipedia:
                          >
                          "As long as only well-formedness is required, XML is a generic framework
                          for storing any amount of text or any data whose structure can be
                          represented as a tree. The only indispensable syntactical requirement is
                          that the document has exactly one root element (alternatively called the
                          document element)."
                          >

                          >
                          --
                          Steven
                          That's my choice of words at work again, I'm afraid. What I mean is,
                          there is no possibility that you can correctly interpret a segment of
                          XML text without knowing certain facts about everything that precedes
                          it. Compare to the case of a fixed-length record file, of record size
                          say 20, where you know the meaning of the characters in offset ranges
                          20-40, 80-100, 500020-500040, etc.

                          To clarify the point of the use case in question, because data would
                          be allocated and located dynamically, its possible that you could read
                          the first several words, then not need anything until say, the 1KB
                          mark. (Unless you're somehow storing an offset in to an XML string as
                          a value in the string, which would require composing it, leaving room
                          for that value, and then writing it with random access anyway.) There
                          can be gaps in a dynamically managed buffer--- say the unused/free
                          bytes from offsets 200 to 220, but every byte that follows another in
                          an XML file follows it in the file's meaning too. Is this any
                          clearer?

                          Aaron

                          Comment

                          • Steven D'Aprano

                            #14
                            Re: dynamic allocation file buffer

                            On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:
                            On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
                            <steven@REMOVE. THIS.cybersourc e.com.audeclaim ed the following in
                            comp.lang.pytho n:
                            >
                            >
                            >I'm pretty sure you're wrong. XML can be used for serialization, but
                            >that doesn't mean it is only sequential data. XML is suitable for
                            >hierarchical data too. To quote Wikipedia:
                            >>
                            There is a difference between the format of the data content, and
                            the processing of that data... Regardless of the content, one
                            essentially has to process the XML /file/ sequentially, and translate
                            into an in-memory model that allows for accessing said data. To reach
                            the nth subelement of the mth element requires reading all 1..m-1
                            elements, followed by all 1..n-1 subelements in m. Modifying any element
                            requires rewriting the entire file.
                            Which is why I previously said that XML was not well suited for random
                            access.

                            I think we're starting to be sucked into a vortex of obtuse and opaque
                            communication. We agree that XML can store hierarchical data, and that it
                            has to be read and written sequentially, and that whatever the merits of
                            castironpi's software, his original use-case of random access to a 4GB
                            XML file isn't workable. Yes?



                            --
                            Steven

                            Comment

                            • Aaron \Castironpi\ Brady

                              #15
                              Re: dynamic allocation file buffer

                              On Sep 12, 1:30 am, Steven D'Aprano
                              <ste...@REMOVE. THIS.cybersourc e.com.auwrote:
                              On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:
                              On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
                              <ste...@REMOVE. THIS.cybersourc e.com.audeclaim ed the following in
                              comp.lang.pytho n:
                              >
                              I'm pretty sure you're wrong. XML can be used for serialization, but
                              that doesn't mean it is only sequential data. XML is suitable for
                              hierarchical data too. To quote Wikipedia:
                              >
                                 There is a difference between the format of the data content, and
                              the processing of that data... Regardless of the content, one
                              essentially has to process the XML /file/ sequentially, and translate
                              into an in-memory model that allows for accessing said data. To reach
                              the nth subelement of the mth element requires reading all 1..m-1
                              elements, followed by all 1..n-1 subelements in m. Modifying any element
                              requires rewriting the entire file.
                              >
                              Which is why I previously said that XML was not well suited for random
                              access.
                              >
                              I think we're starting to be sucked into a vortex of obtuse and opaque
                              communication. We agree that XML can store hierarchical data, and that it
                              has to be read and written sequentially, and that whatever the merits of
                              castironpi's software, his original use-case of random access to a 4GB
                              XML file isn't workable. Yes?
                              >
                              --
                              Steven
                              By 'isn't workable' do you mean, "no one ever uses 4GB of XML", or "no
                              one ever uses 4GB or hierarchical data period"?

                              Comment

                              Working...