lxml question

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Uwe Schmitt

    lxml question

    Hi,

    I have to parse some text which pretends to be XML. lxml does not want
    to parse it, because it lacks a root element.
    I think that this situation is not unusual, so: is there a way to
    force lxml to parse it ?

    My work around is wrapping the text with "<root>...</root>" before
    feeding lxmls parser.

    Greetings, Uwe
  • Mark Thomas

    #2
    Re: lxml question

    On Sep 26, 11:19 am, Uwe Schmitt <rocksportroc.. .@googlemail.co m>
    wrote:
    I have to parse some text which pretends to be XML. lxml does not want
    to parse it, because it lacks a root element.
    I think that this situation is not unusual, so: is there a way to
    force lxml to parse it ?
    By "pretends to be XML" you mean XML-like but not really XML?
    My work around is wrapping the text with "<root>...</root>" before
    feeding lxmls parser.
    That's actually not a bad solution, if you know that the document is
    otherwise well-formed. Another thing you can do is use libxml2's
    "recover" mode which accommodates non-well-formed XML.

    parser = etree.XMLParser (recover=True)
    tree = etree.XML(your_ xml_string, parser)

    You'll still need to use your wrapper root element, because recover
    mode will ignore everything after the first root closes (and it won't
    throw an error).

    -- Mark.

    Comment

    • alex23

      #3
      Re: lxml question

      On Sep 27, 1:19 am, Uwe Schmitt <rocksportroc.. .@googlemail.co m>
      wrote:
      I have to parse some text which pretends to be XML. lxml does not want
      to parse it, because it lacks a root element.
      Another option is BeautifulSoup, which handles badly formed XML really
      well:


      Comment

      • Stefan Behnel

        #4
        Re: lxml question

        Uwe Schmitt wrote:
        I have to parse some text which pretends to be XML. lxml does not want
        to parse it, because it lacks a root element.
        I think that this situation is not unusual, so: is there a way to
        force lxml to parse it ?
        >
        My work around is wrapping the text with "<root>...</root>" before
        feeding lxmls parser.
        Yes, you can do that. To avoid creating an intermediate string, you can use
        the feed parser and do something like this:

        parser = etree.XMLParser ()
        parser.feed("<r oot>")
        parser.feed(you r_xml_tag_seque nce_data)
        parser.feed("</root>")
        root = parser.close()

        Stefan

        Comment

        Working...