Splitting a DOM

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Brice Vissi?re

    Splitting a DOM

    Hello,

    I would like to handle an XML file structured as following
    <ROOT>
    <STEP>
    ....
    </STEP>
    <STEP>
    ....
    </STEP>
    ....
    </ROOT>

    From this file, I want to build an XML file for each STEP block.

    Currently I'm doing something like:

    from xml.dom.ext.rea der import Sax2
    from xml.dom.ext import PrettyPrint

    reader = Sax2.Reader()
    my_dom = reader.fromUri( 'steps.xml')
    steps = my_dom.getEleme ntsByTagName('S TEP')

    i=0
    for step in steps:
    tmp = file('step%s.xm l' % i,'w')
    tmp.write('<?xm l version="1.0" encoding="ISO-8859-1" ?>\n')
    PrettyPrint(ste p , tmp , encoding='ISO-8859-1')
    tmp.close()
    i+=1

    But I'm pretty sure that there's a better way to split the DOM ?

    Thanks for any suggestion provided.

    Brice
  • Alan Kennedy

    #2
    Re: Splitting a DOM

    [Brice Vissi?re][color=blue]
    > But I'm pretty sure that there's a better way to split the DOM ?[/color]

    There's *lots* of ways to solve this one. The "best" solution depends
    on which criteria you choose.

    The most efficient in time and memory is probably SAX, although the
    problem is so simple, a simple textual solution might work well, and
    would definitely be faster.

    Here's a bit of SAX code adapted from another SAX example I posted
    earlier today. Note that this will not work properly if you have
    <STEP> elements nested inside one another. In that case, you'd have to
    maintain a stack of the output files: push the outfile onto the stack
    in "startElement() " and pop it off in "endElement ()".

    #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    import xml.sax
    from xml.sax.saxutil s import escape, quoteattr
    import cStringIO as StringIO

    split_on_elems = ['STEP']

    class splitter(xml.sa x.handler.Conte ntHandler):

    def __init__(self):
    xml.sax.handler .ContentHandler .__init__(self)
    self.outfile = None
    self.seq_no = self.seq_no_gen ()

    def seq_no_gen(self , n=0):
    while True: yield n ; n = n+1

    def startElement(se lf, elemname, attrs):
    if elemname in split_on_elems:
    self.outfile = open('step%04d. xml' % self.seq_no.nex t(), 'wt')
    if self.outfile:
    attrstr = ""
    for a in attrs.keys():
    attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
    quoteattr(attrs[a])))
    self.outfile.wr ite("<%s%s>" % (elemname, attrstr))

    def endElement(self , elemname):
    if self.outfile: self.outfile.wr ite('</%s>' % elemname)
    if elemname in split_on_elems:
    self.outfile.cl ose() ; self.outfile = None

    def characters(self , s):
    if self.outfile: self.outfile.wr ite("%s" % (s,))

    testdoc = """
    <ROOT>
    <STEP a="b" c="d">Step 0</STEP>
    <STEP>Step 1</STEP>
    <STEP>Step 2</STEP>
    <STEP>Step 3</STEP>
    <STEP>Step 4</STEP>
    </ROOT>
    """

    if __name__ == "__main__":
    parser = xml.sax.make_pa rser()
    PFJ = splitter()
    parser.setConte ntHandler(PFJ)
    parser.setFeatu re(xml.sax.hand ler.feature_nam espaces, 0)
    parser.feed(tes tdoc)
    #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

    HTH,

    --
    alan kennedy
    ------------------------------------------------------
    check http headers here: http://xhaus.com/headers
    email alan: http://xhaus.com/contact/alan

    Comment

    • Uche Ogbuji

      #3
      Re: Splitting a DOM

      brice.vissiere@ costes-gestion.net (Brice Vissi?re) wrote in message news:<fa538331. 0402120759.44f2 0301@posting.go ogle.com>...[color=blue]
      > Hello,
      >
      > I would like to handle an XML file structured as following
      > <ROOT>
      > <STEP>
      > ...
      > </STEP>
      > <STEP>
      > ...
      > </STEP>
      > ...
      > </ROOT>
      >
      > From this file, I want to build an XML file for each STEP block.
      >
      > Currently I'm doing something like:
      >
      > from xml.dom.ext.rea der import Sax2
      > from xml.dom.ext import PrettyPrint
      >
      > reader = Sax2.Reader()
      > my_dom = reader.fromUri( 'steps.xml')
      > steps = my_dom.getEleme ntsByTagName('S TEP')
      >
      > i=0
      > for step in steps:
      > tmp = file('step%s.xm l' % i,'w')
      > tmp.write('<?xm l version="1.0" encoding="ISO-8859-1" ?>\n')
      > PrettyPrint(ste p , tmp , encoding='ISO-8859-1')
      > tmp.close()
      > i+=1
      >
      > But I'm pretty sure that there's a better way to split the DOM ?[/color]

      Here's an Anobind recipe:

      --- % ---

      #Boilerplate set-up

      import anobind
      from Ft.Xml import InputSource
      from Ft.Lib import Uri

      #Create an input source for the XML
      isrc_factory = InputSource.Def aultFactory
      #Create a URI from a filename the right way
      file_uri = Uri.OsPathToUri ('steps.xml', attemptAbsolute =1)
      isrc = isrc_factory.fr omUri(file_uri)

      #Now bind from the XML given in the input source
      binder = anobind.binder( )
      binding = binder.read_xml (isrc)

      #File splitting task
      import tempfile

      #The direct approach
      i = 0
      for folder in binding.xbel.fo lder:
      fout = open('step%s.xm l', 'w')
      folder.unbind(f out)
      fout.close()
      i += 1

      --- % ---

      To use XPath replace the line

      for folder in binding.xbel.fo lder:

      With

      for folder in binding.xpath_q uery(u'xbel/folder'):

      Anobind: http://uche.ogbuji.net/tech/4Suite/anobind/

      --Uche
      Igbo-American immigrant from Nigeria, settled near Boulder, Colorado with my wife, three sons and daughter. Restless mind in a restless body, I do a million things without getting very much truly done

      Comment

      • Uche Ogbuji

        #4
        Re: Splitting a DOM

        brice.vissiere@ costes-gestion.net (Brice Vissi?re) wrote in message news:<fa538331. 0402120759.44f2 0301@posting.go ogle.com>...[color=blue]
        > Hello,
        >
        > I would like to handle an XML file structured as following
        > <ROOT>
        > <STEP>
        > ...
        > </STEP>
        > <STEP>
        > ...
        > </STEP>
        > ...
        > </ROOT>
        >
        > From this file, I want to build an XML file for each STEP block.
        >
        > Currently I'm doing something like:
        >
        > from xml.dom.ext.rea der import Sax2
        > from xml.dom.ext import PrettyPrint
        >
        > reader = Sax2.Reader()
        > my_dom = reader.fromUri( 'steps.xml')
        > steps = my_dom.getEleme ntsByTagName('S TEP')
        >
        > i=0
        > for step in steps:
        > tmp = file('step%s.xm l' % i,'w')
        > tmp.write('<?xm l version="1.0" encoding="ISO-8859-1" ?>\n')
        > PrettyPrint(ste p , tmp , encoding='ISO-8859-1')
        > tmp.close()
        > i+=1
        >
        > But I'm pretty sure that there's a better way to split the DOM ?[/color]


        Here's an Anobind recipe:

        --- % ---

        #Boilerplate set-up

        import anobind
        from Ft.Xml import InputSource
        from Ft.Lib import Uri

        #Create an input source for the XML
        isrc_factory = InputSource.Def aultFactory
        #Create a URI from a filename the right way
        file_uri = Uri.OsPathToUri ('steps.xml', attemptAbsolute =1)
        isrc = isrc_factory.fr omUri(file_uri)

        #Now bind from the XML given in the input source
        binder = anobind.binder( )
        binding = binder.read_xml (isrc)

        #File splitting task
        import tempfile

        #The direct approach
        i = 0
        for folder in binding.ROOT.ST EP:
        fout = open('step%s.xm l', 'w')
        folder.unbind(f out)
        fout.close()
        i += 1

        --- % ---

        To use XPath replace the line

        for folder in binding.ROOT.ST EP:

        With

        for folder in binding.xpath_q uery(u'ROOT/STEP'):

        Anobind: http://uche.ogbuji.net/tech/4Suite/anobind/

        --Uche
        Igbo-American immigrant from Nigeria, settled near Boulder, Colorado with my wife, three sons and daughter. Restless mind in a restless body, I do a million things without getting very much truly done

        Comment

        • Uche Ogbuji

          #5
          Re: Splitting a DOM

          brice.vissiere@ costes-gestion.net (Brice Vissi?re) wrote in message news:<fa538331. 0402120759.44f2 0301@posting.go ogle.com>...[color=blue]
          > Hello,
          >
          > I would like to handle an XML file structured as following
          > <ROOT>
          > <STEP>
          > ...
          > </STEP>
          > <STEP>
          > ...
          > </STEP>
          > ...
          > </ROOT>
          >
          > From this file, I want to build an XML file for each STEP block.
          >
          > Currently I'm doing something like:
          >
          > from xml.dom.ext.rea der import Sax2
          > from xml.dom.ext import PrettyPrint
          >
          > reader = Sax2.Reader()
          > my_dom = reader.fromUri( 'steps.xml')
          > steps = my_dom.getEleme ntsByTagName('S TEP')
          >
          > i=0
          > for step in steps:
          > tmp = file('step%s.xm l' % i,'w')
          > tmp.write('<?xm l version="1.0" encoding="ISO-8859-1" ?>\n')
          > PrettyPrint(ste p , tmp , encoding='ISO-8859-1')
          > tmp.close()
          > i+=1
          >
          > But I'm pretty sure that there's a better way to split the DOM ?[/color]

          I already gave an Aobind recipe foir this one, but I wanted to also
          post a few notes on your chosen approach:

          1) "from xml.dom.ext.rea der import Sax2" means you're using 4DOM.
          4DOM is very slow. If you find this is a problem, use minidom. My
          aob ind recipe used cDomlette, which is *very* fast, and even faster
          than minidom, certainly, but requires installing 3rd party software.

          2) "steps = my_dom.getEleme ntsByTagName('S TEP')". This could give
          unexpected results in the case that you have nested STEP elements.
          You might want to use a list comprehension such as

          steps = [ step for step in my_dom.document Element.childNo des if
          step.nodeName == u"STEP" ]

          Good luck.

          --Uche
          Igbo-American immigrant from Nigeria, settled near Boulder, Colorado with my wife, three sons and daughter. Restless mind in a restless body, I do a million things without getting very much truly done

          Comment

          Working...