Parsing XML streams

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Peter Scott

    Parsing XML streams

    I have a program that listens on an IRC channel and logs everything to
    XML on standard output. The format of the XML is pretty
    straightforward , looking like this:

    <channel name='#sandbox' >
    <message user='PeterScot t'>Hello, my bot</message>
    <message user='PeterScot t'>This is a message</message>
    <nickchange>
    <oldnick>PeterS cott</oldnick>
    <newnick>PeterS c</newnick>
    </nickchange>
    </channel>

    I'm writing another program that should parse that sort of XML on its
    stdin, printing out a more user-friendly representation. For this, I
    need to parse the XML as it comes in, not all at once.

    I wrote a parser using xml.sax, and it works well---provided that I
    read in the whole document. However, I want to be able to just read
    the XML piece by piece, calling event handlers whenever something
    happens and waiting for more to happen.

    Is there some way to do this with the standard python xml parsers?
    Will I need to use PyXML? Or what?

    Thanks,
    -Peter
  • Jeremy Bowers

    #2
    Re: Parsing XML streams

    On Thu, 11 Sep 2003 16:30:18 -0700, Peter Scott wrote:
    [color=blue]
    > Is there some way to do this with the standard python xml parsers?
    > Will I need to use PyXML? Or what?[/color]

    xml.parsers.exp at can parse things in pieces. It shouldn't be *too* much
    work to convert over.

    Comment

    • Alan Kennedy

      #3
      Re: Parsing XML streams

      Peter Scott wrote:[color=blue]
      > I'm writing another program that should parse that sort of XML on its
      > stdin, printing out a more user-friendly representation. For this, I
      > need to parse the XML as it comes in, not all at once.[/color]

      Peter,

      Check out the IncrementalPars er class in the library module

      Lib/xml/sax/xmlreader.py

      This extension of the standard XMLReader class acts just like a SAX
      parser, in that it delivers SAX2 events to your ContentHandler as it
      processes the tokens from the source XML document.

      But rather than the parser itself controlling when and how it gets its
      input, you control that through the use of the .feed() method. So you
      can "drip feed" the parser with input if you wish.

      Not all XML parsers support an IncrementalPars er interface. In order
      for an XML parser to support incremental parsing, it must have been
      coded specifically to do so. Fortunately, the expat wrapper supplied
      with the base distribution of python does support incremental parsing.

      Which I think should solve your problem quite nicely. When you start
      up your process for the first time, feed() the IncrementalPars er a
      document element (all XML document must have one and only one document
      element). Then simply feed the output of your logging stream directly
      to the IncrementalPars er, as and when you receive it.

      You should not have any problems with XML tokens being split over two
      different .feed() calls either. For example, this should work just
      fine

      ip = IncrementalPars er()
      ip.feed('<docu' )
      ip.feed('ment')
      ip.feed('/>')

      When your logging stream is closing, simply feed a close tag for your
      document element to your IncrementalPars er, and everything will clean
      up nicely.

      Here is some sample code:

      #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
      import xml.sax
      from xml.sax.handler import ContentHandler

      logentry = """
      <channel name='#sandbox' >
      <message user='PeterScot t'>Hello, my bot</message>
      <message user='PeterScot t'>This is a message</message>
      <nickchange>
      <oldnick>PeterS cott</oldnick>
      <newnick>PeterS c</newnick>
      </nickchange>
      </channel>
      """

      incr_parser = xml.sax.make_pa rser('xml.sax.e xpatreader')
      incr_parser.set ContentHandler( ContentHandler( ))
      incr_parser.fee d('<mylogstream >')
      incr_parser.fee d(logentry)
      incr_parser.fee d('</mylogstream>')
      #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

      regards,

      --
      alan kennedy
      -----------------------------------------------------
      check http headers here: http://xhaus.com/headers
      email alan: http://xhaus.com/mailto/alan

      Comment

      Working...