encoding in lxml

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jasiu85

    encoding in lxml

    Hey,

    I have a problem with character encoding in LXML. Here's how it goes:

    I read an HTML document from a third-party site. It is supposed to be
    in UTF-8, but unfortunately from time to time it's not. I parse the
    document like this:

    html_doc = HTML(string_wit h_document)

    Then I retrieve some info from the document with XPath:

    xpath_nodes = html_doc('/html/body/something')

    Now I'm guaranteed that the xpath_nodes list contains only one
    element. So I read it's content:

    xpath_nodes[0].text

    And I get exception here. The exception is coming from the text
    property of an Element object. The problem is that the text contains a
    non-utf8 character. LXML seems to be using strict decoding and I can't
    find a way to make it ignore the error. Is there anything I can do to
    retrieve the text without getting an exception?

    Regards,

    Mike
  • pjacobi.de@googlemail.com

    #2
    Re: encoding in lxml

    Hi Mike,
    I read an HTML document from a third-party site. It is supposed to be
    in UTF-8, but unfortunately from time to time it's not.
    There will be host of more lightweight solutions, but you can opt
    to sanizite incominhg HTML with HTML Tidy (python binding available).

    It will replace invalid UTF-8 bytes with U+FFFD. It will not
    guess a better encoding to use.

    If you are sure you don't have HTML sloppiness to correct but only
    the
    occasional wrong byte, even decoding (with fallback) and encoding
    using
    the standard codec package will do.

    Regards,
    Peter

    Comment

    • Stefan Behnel

      #3
      Re: encoding in lxml

      jasiu85 wrote:
      I have a problem with character encoding in LXML. Here's how it goes:
      >
      I read an HTML document from a third-party site. It is supposed to be
      in UTF-8, but unfortunately from time to time it's not.
      You can instantiate your own HTML parser and pass encoding="utf-8". That way,
      when it's not UTF-8, you will get an exception at parse time, which allows you
      to reparse the document with another encoding (say, ISO-8859-1) to get the
      correct content.

      Stefan

      Comment

      Working...