Re: dynamic allocation file buffer

**castironpi** · Sep 10 '08, 12:55 AM

Re: dynamic allocation file buffer

On Sep 9, 5:58 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com .auwrote:

On Tue, 09 Sep 2008 14:59:19 -0700, castironpi wrote:

I will try my idea again. I want to talk to people about a module I
want to write and I will take the time to explain it. I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from. What are its flaws?

>
[snip long description with not-very-credible use-cases]

Steven,

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file, and how much crack did they smoke?

I judge from the existence of 'shelve' and 'pickle' modules, and
relational database packages, that the problem I am addressing is not
rare. It could be the millionaire investor across the street, the
venture capitalist down the hall, or the guy with a huge CD catalog.

Castironpi, what do *you* use this proof-of-concept module for?

Honestly, nothing yet. I just wrote it. My user community and
customer base are very small. Originally, I wanted to store variable-
length strings in a file, where shelves and databases were overkill.
I created it for its beauty, sorry to disappoint.

Don't
bother tell us what you think *we* should use it for. Tell us what you're
using it for, or at least what somebody else is using it for. If this is
just a module that you think will be cool, I don't like your chances of
people caring. There is no shortage of "cool" software that isn't useful
for anything, and unlike eye-candy, nobody is going to use your module
just because they like the algorithm.

Unfortunately, nobody is going to care about most of the uses I have
for it 'til I have a job. I'm goofing around with a laptop,
remembering when my databases professor kept dropping the ball on
VARCHARs. If you want a sound byte, think, "imagine programming
without 'new' and 'malloc'."

If you don't have an existing application for the software, then explain
what it does (not how) and give some idea of the performance ("it's alpha
and written in Python and really slow, but I will re-write it in C and
expect it to make a billion random accesses in a 10GB file per
millisecond", or whatever). You might be lucky and have somebody say
"Hey, that's just the tool I need to solve my problem!".

I wrote a Rope implementation just to test drive it. It exceeded the
native immutable string type at 2 megs. It used 'struct' instead of
'ctypes', so that number could conceivably come down. I am intending
to leave it in pure Python, so there.

--
Steven

Pleasure chatting as always sir.

**Fredrik Lundh** · Sep 10 '08, 07:35 AM

Re: dynamic allocation file buffer

Steven D'Aprano wrote:

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file

Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.

</F>

**Steven D'Aprano** · Sep 10 '08, 10:25 AM

Re: dynamic allocation file buffer

On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:

Steven D'Aprano wrote:
>

>You've created a solution to a problem which (probably) only affects a
>very small number of people, at least judging by your use-cases. Who
>has a 4GB XML file

>
Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.

Fair enough, that's a good point.

But would you expect random access to a 4GB XML file? If I've understood
what Castironpi is trying for, his primary use case was for people
wanting exactly that.

--
Steven

**Aaron \Castironpi\ Brady** · Sep 10 '08, 07:05 PM

Re: dynamic allocation file buffer

On Sep 10, 5:24 am, Steven D'Aprano
<ste...@REMOVE. THIS.cybersourc e.com.auwrote:

On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:

Steven D'Aprano wrote:

>

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who
has a 4GB XML file

>

Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.

>
Fair enough, that's a good point.
>
But would you expect random access to a 4GB XML file? If I've understood
what Castironpi is trying for, his primary use case was for people
wanting exactly that.
>
--
Steven

Steven,

Are you claiming that sequential storage is sufficient for small
amounts of data, and relational db.s are necessary for large amounts?
It's possible that there is only the fringe exception, in which case
'alloc/free' aren't useful in the majority of cases, and will never
win customers away from the more mature competition.

Regardless, it is an elegant solution to the problem of storing
variable-length strings, with hardly any practical value. Perfect for
grad school.

**Steven D'Aprano** · Sep 11 '08, 07:45 AM

Re: dynamic allocation file buffer

On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi \" Brady wrote:

On Sep 10, 5:24Â am, Steven D'Aprano
<ste...@REMOVE. THIS.cybersourc e.com.auwrote:

>On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:

Steven D'Aprano wrote:

>>

>You've created a solution to a problem which (probably) only affects
>a very small number of people, at least judging by your use-cases.
>Who has a 4GB XML file

>>

Getting 4GB XML files from, say, logging processes or databases that
can render their output as XML is not that uncommon. Â They're usually
record-oriented, and are intended to be processed as streams. Â And
given the right tools, doing that is no harder than doing the same to
a 4GB text file.

>>
>Fair enough, that's a good point.
>>
>But would you expect random access to a 4GB XML file? If I've
>understood what Castironpi is trying for, his primary use case was for
>people wanting exactly that.
>>
>--
>Steven

>
Steven,
>
Are you claiming that sequential storage is sufficient for small amounts
of data, and relational db.s are necessary for large amounts?

I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

I'm interested in what Fredrik has to say about this, as he's the author
of ElementTree.

--
Steven

**Fredrik Lundh** · Sep 11 '08, 08:35 AM

Re: dynamic allocation file buffer

Steven D'Aprano wrote:

I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

An XML file doesn't contain any indexing information, so random access
to a large XML file is very inefficient. You can build (or precompute)
index information and store in a separate file, of course, but that's
hardly something that's useful in the general case.

And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;
data that's intended to be consumed by sequential processes (and you can
do a lot with sequential processing these days; for those interested in
this, digging up a few review papers on "data stream processing" might
be a good way to waste some time).

Document-style XML usually fits into memory on modern machines;
structures larger than that are usually split into different parts (e.g.
using XInclude) and stored in a container file.

Random *modifications* to an arbitrary XML file cannot be done, as long
as you store the file in a standard file system. And if you invent your
own format, it's no longer an XML file.

</F>

**Paul Boddie** · Sep 11 '08, 10:45 AM

Re: dynamic allocation file buffer

On 11 Sep, 10:34, Fredrik Lundh <fred...@python ware.comwrote:

>
And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;

I can imagine that the manipulation of the persistent form of large
graph structures might be another use case, although for efficient
navigation of such a structure, which is what you'd need to start
applying various graph algorithms, one would need some kind of index.
Certainly, we're straying into database territory.

Paul

**Aaron \Castironpi\ Brady** · Sep 11 '08, 05:25 PM

Re: dynamic allocation file buffer

On Sep 11, 2:40 am, Steven D'Aprano
<ste...@REMOVE. THIS.cybersourc e.com.auwrote:

On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi \" Brady wrote:

On Sep 10, 5:24 am, Steven D'Aprano
<ste...@REMOVE. THIS.cybersourc e.com.auwrote:

On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
Steven D'Aprano wrote:

>

You've created a solution to a problem which (probably) only affects
a very small number of people, at least judging by your use-cases.
Who has a 4GB XML file

>

Getting 4GB XML files from, say, logging processes or databases that
can render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And
given the right tools, doing that is no harder than doing the same to
a 4GB text file.

>

Fair enough, that's a good point.

>

But would you expect random access to a 4GB XML file? If I've
understood what Castironpi is trying for, his primary use case was for
people wanting exactly that.

>

--
Steven

>

Steven,

>

Are you claiming that sequential storage is sufficient for small amounts
of data, and relational db.s are necessary for large amounts?

>
I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.
>
I'm interested in what Fredrik has to say about this, as he's the author
of ElementTree.
>
--
Steven

XML is the wrong word for the example I was thinking of (as was
already pointed out in another thread). XML is by definition
sequential. The use case pertained to a generic element hierarchy;
think of 4GB of hierarchical data.

**Aaron \Castironpi\ Brady** · Sep 11 '08, 05:35 PM

Re: dynamic allocation file buffer

On Sep 11, 5:35 am, Paul Boddie <p...@boddie.or g.ukwrote:

On 11 Sep, 10:34, Fredrik Lundh <fred...@python ware.comwrote:
>
>
>

And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;

>
I can imagine that the manipulation of the persistent form of large
graph structures might be another use case, although for efficient
navigation of such a structure, which is what you'd need to start
applying various graph algorithms, one would need some kind of index.
Certainly, we're straying into database territory.
>
Paul

An acquaintance suggests that defragmentation would be a useful
service to provide along with memory management too, which also
requires an index.

I encourage overlap between a bare-bones alloc/free module and
established database territory and I'm very aware of it.

Databases already support both concurrency and persistence, but don't
tell me you'd use a database for IPC. And don't tell me you've never
wished you had a reference to a record in a table so that you could
make an update just by changing one word of memory at the right
place. Sometimes databases are overkill where all you want is dynamic
allocation.

**Paul Boddie** · Sep 11 '08, 10:05 PM

Re: dynamic allocation file buffer

On 11 Sep, 19:31, "Aaron \"Castironpi \" Brady" <castiro...@gma il.com>
wrote:

>
An acquaintance suggests that defragmentation would be a useful
service to provide along with memory management too, which also
requires an index.

I presume that you mean efficient access to large amounts of data in
the sense that if all the data you want happens to be in the same page
or segment, then retrieving it is much more efficient than having to
seek around for all the different pieces. So the defragmentation would
be what they call clustering in a relational database context:

CLUSTER

http://www.postgresql.org/docs/8.3/static/sql-cluster.html

I've seen similar phenomena outside the relational database world,
notably with big Lucene indexes which wouldn't fit in memory in their
entirety.

I encourage overlap between a bare-bones alloc/free module and
established database territory and I'm very aware of it.
>
Databases already support both concurrency and persistence, but don't
tell me you'd use a database for IPC.

Of course, databases are widely used in scalable systems to hold
central state, which is why there's a lot of effort put into to not
only scaling up database installations, but also into things like
caching which are supposed to save the database systems behind popular
Web applications from excessive load.

And don't tell me you've never
wished you had a reference to a record in a table so that you could
make an update just by changing one word of memory at the right
place. Sometimes databases are overkill where all you want is dynamic
allocation.

I think that the challenge is to reduce an abstract operation (for
example, wanting to update a particular column in a particular record)
to its measurable effects (this word of memory/disk will change as a
consequence). It's easy for a human with a reasonable knowledge of,
say, a relational database system to anticipate such things, but to
actually collapse a number of layers through some kind of generic
optimisation process is a lot more difficult.

Paul

**Steven D'Aprano** · Sep 12 '08, 03:45 AM

Re: dynamic allocation file buffer

On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi \" Brady wrote:

XML is the wrong word for the example I was thinking of (as was already
pointed out in another thread). XML is by definition sequential.

I'm pretty sure you're wrong. XML can be used for serialization, but that
doesn't mean it is only sequential data. XML is suitable for hierarchical
data too. To quote Wikipedia:

"As long as only well-formedness is required, XML is a generic framework
for storing any amount of text or any data whose structure can be
represented as a tree. The only indispensable syntactical requirement is
that the document has exactly one root element (alternatively called the
document element)."

XML - Wikipedia

http://en.wikipedia.org/wiki/Xml

--
Steven

**Aaron \Castironpi\ Brady** · Sep 12 '08, 05:35 AM

Re: dynamic allocation file buffer

On Sep 11, 10:37 pm, Steven D'Aprano
<ste...@REMOVE. THIS.cybersourc e.com.auwrote:

On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi \" Brady wrote:

XML is the wrong word for the example I was thinking of (as was already
pointed out in another thread). XML is by definition sequential.

>
I'm pretty sure you're wrong. XML can be used for serialization, but that
doesn't mean it is only sequential data. XML is suitable for hierarchical
data too. To quote Wikipedia:
>
"As long as only well-formedness is required, XML is a generic framework
for storing any amount of text or any data whose structure can be
represented as a tree. The only indispensable syntactical requirement is
that the document has exactly one root element (alternatively called the
document element)."
>

XML - Wikipedia

http://en.wikipedia.org/wiki/Xml

>
--
Steven

That's my choice of words at work again, I'm afraid. What I mean is,
there is no possibility that you can correctly interpret a segment of
XML text without knowing certain facts about everything that precedes
it. Compare to the case of a fixed-length record file, of record size
say 20, where you know the meaning of the characters in offset ranges
20-40, 80-100, 500020-500040, etc.

To clarify the point of the use case in question, because data would
be allocated and located dynamically, its possible that you could read
the first several words, then not need anything until say, the 1KB
mark. (Unless you're somehow storing an offset in to an XML string as
a value in the string, which would require composing it, leaving room
for that value, and then writing it with random access anyway.) There
can be gaps in a dynamically managed buffer--- say the unused/free
bytes from offsets 200 to 220, but every byte that follows another in
an XML file follows it in the file's meaning too. Is this any
clearer?

Aaron

**Steven D'Aprano** · Sep 12 '08, 06:35 AM

Re: dynamic allocation file buffer

On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:

On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
<steven@REMOVE. THIS.cybersourc e.com.audeclaim ed the following in
comp.lang.pytho n:
>
>

>I'm pretty sure you're wrong. XML can be used for serialization, but
>that doesn't mean it is only sequential data. XML is suitable for
>hierarchical data too. To quote Wikipedia:
>>

There is a difference between the format of the data content, and
the processing of that data... Regardless of the content, one
essentially has to process the XML /file/ sequentially, and translate
into an in-memory model that allows for accessing said data. To reach
the nth subelement of the mth element requires reading all 1..m-1
elements, followed by all 1..n-1 subelements in m. Modifying any element
requires rewriting the entire file.

Which is why I previously said that XML was not well suited for random
access.

I think we're starting to be sucked into a vortex of obtuse and opaque
communication. We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?

--
Steven

**Aaron \Castironpi\ Brady** · Sep 12 '08, 06:55 AM

Re: dynamic allocation file buffer

On Sep 12, 1:30 am, Steven D'Aprano
<ste...@REMOVE. THIS.cybersourc e.com.auwrote:

On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:

On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
<ste...@REMOVE. THIS.cybersourc e.com.audeclaim ed the following in
comp.lang.pytho n:

>

I'm pretty sure you're wrong. XML can be used for serialization, but
that doesn't mean it is only sequential data. XML is suitable for
hierarchical data too. To quote Wikipedia:

>

There is a difference between the format of the data content, and
the processing of that data... Regardless of the content, one
essentially has to process the XML /file/ sequentially, and translate
into an in-memory model that allows for accessing said data. To reach
the nth subelement of the mth element requires reading all 1..m-1
elements, followed by all 1..n-1 subelements in m. Modifying any element
requires rewriting the entire file.

>
Which is why I previously said that XML was not well suited for random
access.
>
I think we're starting to be sucked into a vortex of obtuse and opaque
communication. We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?
>
--
Steven

By 'isn't workable' do you mean, "no one ever uses 4GB of XML", or "no
one ever uses 4GB or hierarchical data period"?

Re: dynamic allocation file buffer

Re: dynamic allocation file buffer

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment