handling xml embedded within xml

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Avowkind

    handling xml embedded within xml

    I have a log file within which is contained a dump of an xml message

    .... rubbish
    ///asd laksj aslf
    <nif_DEBUG time="Fri, 16 May 2008 13:40:17, 330">
    <?xml version="1.0" encoding="UTF-8"?>
    <ns>
    <PDQ Lang="fr-FR" ID="XM;1928">co ntent</PDQ>
    </ns>
    </nif_DEBUG>
    ... more junk
    .... then more xml
    """)
    This example is of course a summary.

    I want to write a streaming filter which will throw out all the junk
    and just return a series of nice strings of each complete xml
    message. Ideally I also want to filter which messages I am interested
    in.

    e.g. the output from the above would be
    <?xml version="1.0" encoding="UTF-8"?>
    <ns>
    <PDQ Lang="fr-FR" ID="XM;1928">co ntent</PDQ>
    </ns>

    Two problems.
    1. clearing away junk that is nothing like XML.
    2. handling the <? xml declaration that lies inside the other xml
    tags.

    the first I can handle relatively simply by reading through the string
    until I get what looks like a valid XML tag. I can then pass the rest
    onto an xml parser like xml.sax. However the parser then excepts out
    with :
    XMLSyntaxError: XML declaration allowed only at the start of the
    document

    I would like a more forgiving parser that handles bad xml by a call
    back that I can just say carry on to.
    Bear in mind also I probably will not have the end of the stream while
    initially processing.

    All suggestions and pointers welcome
    Andrew


Working...