Splitting a DOM

**Alan Kennedy** · Jul 18 '05, 08:24 AM

Re: Splitting a DOM

[Brice Vissi?re][color=blue]
> But I'm pretty sure that there's a better way to split the DOM ?[/color]

There's *lots* of ways to solve this one. The "best" solution depends
on which criteria you choose.

The most efficient in time and memory is probably SAX, although the
problem is so simple, a simple textual solution might work well, and
would definitely be faster.

Here's a bit of SAX code adapted from another SAX example I posted
earlier today. Note that this will not work properly if you have
<STEP> elements nested inside one another. In that case, you'd have to
maintain a stack of the output files: push the outfile onto the stack
in "startElement() " and pop it off in "endElement ()".

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutil s import escape, quoteattr
import cStringIO as StringIO

split_on_elems = ['STEP']

class splitter(xml.sa x.handler.Conte ntHandler):

def __init__(self):
xml.sax.handler .ContentHandler .__init__(self)
self.outfile = None
self.seq_no = self.seq_no_gen ()

def seq_no_gen(self , n=0):
while True: yield n ; n = n+1

def startElement(se lf, elemname, attrs):
if elemname in split_on_elems:
self.outfile = open('step%04d. xml' % self.seq_no.nex t(), 'wt')
if self.outfile:
attrstr = ""
for a in attrs.keys():
attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
self.outfile.wr ite("<%s%s>" % (elemname, attrstr))

def endElement(self , elemname):
if self.outfile: self.outfile.wr ite('</%s>' % elemname)
if elemname in split_on_elems:
self.outfile.cl ose() ; self.outfile = None

def characters(self , s):
if self.outfile: self.outfile.wr ite("%s" % (s,))

testdoc = """
<ROOT>
<STEP a="b" c="d">Step 0</STEP>
<STEP>Step 1</STEP>
<STEP>Step 2</STEP>
<STEP>Step 3</STEP>
<STEP>Step 4</STEP>
</ROOT>
"""

if __name__ == "__main__":
parser = xml.sax.make_pa rser()
PFJ = splitter()
parser.setConte ntHandler(PFJ)
parser.setFeatu re(xml.sax.hand ler.feature_nam espaces, 0)
parser.feed(tes tdoc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

HTH,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan

**Uche Ogbuji** · Jul 18 '05, 08:25 AM

Re: Splitting a DOM

brice.vissiere@ costes-gestion.net (Brice Vissi?re) wrote in message news:<fa538331. 0402120759.44f2 0301@posting.go ogle.com>...[color=blue]
> Hello,
>
> I would like to handle an XML file structured as following
> <ROOT>
> <STEP>
> ...
> </STEP>
> <STEP>
> ...
> </STEP>
> ...
> </ROOT>
>
> From this file, I want to build an XML file for each STEP block.
>
> Currently I'm doing something like:
>
> from xml.dom.ext.rea der import Sax2
> from xml.dom.ext import PrettyPrint
>
> reader = Sax2.Reader()
> my_dom = reader.fromUri( 'steps.xml')
> steps = my_dom.getEleme ntsByTagName('S TEP')
>
> i=0
> for step in steps:
> tmp = file('step%s.xm l' % i,'w')
> tmp.write('<?xm l version="1.0" encoding="ISO-8859-1" ?>\n')
> PrettyPrint(ste p , tmp , encoding='ISO-8859-1')
> tmp.close()
> i+=1
>
> But I'm pretty sure that there's a better way to split the DOM ?[/color]

Here's an Anobind recipe:

--- % ---

#Boilerplate set-up

import anobind
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Create an input source for the XML
isrc_factory = InputSource.Def aultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri ('steps.xml', attemptAbsolute =1)
isrc = isrc_factory.fr omUri(file_uri)

#Now bind from the XML given in the input source
binder = anobind.binder( )
binding = binder.read_xml (isrc)

#File splitting task
import tempfile

#The direct approach
i = 0
for folder in binding.xbel.fo lder:
fout = open('step%s.xm l', 'w')
folder.unbind(f out)
fout.close()
i += 1

--- % ---

To use XPath replace the line

for folder in binding.xbel.fo lder:

With

for folder in binding.xpath_q uery(u'xbel/folder'):

Anobind: http://uche.ogbuji.net/tech/4Suite/anobind/

--Uche

Uche Ogbuji's Home

http://uche.ogbuji.net

Igbo-American immigrant from Nigeria, settled near Boulder, Colorado with my wife, three sons and daughter. Restless mind in a restless body, I do a million things without getting very much truly done

**Uche Ogbuji** · Jul 18 '05, 08:25 AM

Re: Splitting a DOM

brice.vissiere@ costes-gestion.net (Brice Vissi?re) wrote in message news:<fa538331. 0402120759.44f2 0301@posting.go ogle.com>...[color=blue]
> Hello,
>
> I would like to handle an XML file structured as following
> <ROOT>
> <STEP>
> ...
> </STEP>
> <STEP>
> ...
> </STEP>
> ...
> </ROOT>
>
> From this file, I want to build an XML file for each STEP block.
>
> Currently I'm doing something like:
>
> from xml.dom.ext.rea der import Sax2
> from xml.dom.ext import PrettyPrint
>
> reader = Sax2.Reader()
> my_dom = reader.fromUri( 'steps.xml')
> steps = my_dom.getEleme ntsByTagName('S TEP')
>
> i=0
> for step in steps:
> tmp = file('step%s.xm l' % i,'w')
> tmp.write('<?xm l version="1.0" encoding="ISO-8859-1" ?>\n')
> PrettyPrint(ste p , tmp , encoding='ISO-8859-1')
> tmp.close()
> i+=1
>
> But I'm pretty sure that there's a better way to split the DOM ?[/color]

Here's an Anobind recipe:

--- % ---

#Boilerplate set-up

import anobind
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Create an input source for the XML
isrc_factory = InputSource.Def aultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri ('steps.xml', attemptAbsolute =1)
isrc = isrc_factory.fr omUri(file_uri)

#Now bind from the XML given in the input source
binder = anobind.binder( )
binding = binder.read_xml (isrc)

#File splitting task
import tempfile

#The direct approach
i = 0
for folder in binding.ROOT.ST EP:
fout = open('step%s.xm l', 'w')
folder.unbind(f out)
fout.close()
i += 1

--- % ---

To use XPath replace the line

for folder in binding.ROOT.ST EP:

With

for folder in binding.xpath_q uery(u'ROOT/STEP'):

Anobind: http://uche.ogbuji.net/tech/4Suite/anobind/

--Uche

Uche Ogbuji's Home

http://uche.ogbuji.net

Igbo-American immigrant from Nigeria, settled near Boulder, Colorado with my wife, three sons and daughter. Restless mind in a restless body, I do a million things without getting very much truly done

**Uche Ogbuji** · Jul 18 '05, 08:25 AM

Re: Splitting a DOM

brice.vissiere@ costes-gestion.net (Brice Vissi?re) wrote in message news:<fa538331. 0402120759.44f2 0301@posting.go ogle.com>...[color=blue]
> Hello,
>
> I would like to handle an XML file structured as following
> <ROOT>
> <STEP>
> ...
> </STEP>
> <STEP>
> ...
> </STEP>
> ...
> </ROOT>
>
> From this file, I want to build an XML file for each STEP block.
>
> Currently I'm doing something like:
>
> from xml.dom.ext.rea der import Sax2
> from xml.dom.ext import PrettyPrint
>
> reader = Sax2.Reader()
> my_dom = reader.fromUri( 'steps.xml')
> steps = my_dom.getEleme ntsByTagName('S TEP')
>
> i=0
> for step in steps:
> tmp = file('step%s.xm l' % i,'w')
> tmp.write('<?xm l version="1.0" encoding="ISO-8859-1" ?>\n')
> PrettyPrint(ste p , tmp , encoding='ISO-8859-1')
> tmp.close()
> i+=1
>
> But I'm pretty sure that there's a better way to split the DOM ?[/color]

I already gave an Aobind recipe foir this one, but I wanted to also
post a few notes on your chosen approach:

1) "from xml.dom.ext.rea der import Sax2" means you're using 4DOM.
4DOM is very slow. If you find this is a problem, use minidom. My
aob ind recipe used cDomlette, which is *very* fast, and even faster
than minidom, certainly, but requires installing 3rd party software.

2) "steps = my_dom.getEleme ntsByTagName('S TEP')". This could give
unexpected results in the case that you have nested STEP elements.
You might want to use a list comprehension such as

steps = [ step for step in my_dom.document Element.childNo des if
step.nodeName == u"STEP" ]

Good luck.

--Uche

Uche Ogbuji's Home

http://uche.ogbuji.net

Igbo-American immigrant from Nigeria, settled near Boulder, Colorado with my wife, three sons and daughter. Restless mind in a restless body, I do a million things without getting very much truly done

Splitting a DOM

Splitting a DOM

Comment

Comment

Comment

Comment