Python HTML parser chokes on UTF-8 input

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Johannes Bauer

    Python HTML parser chokes on UTF-8 input

    Hello group,

    I'm trying to use a htmllib.HTMLPar ser derivate class to parse a website
    which I fetched via
    httplib.HTTPCon nection().reque st().getrespons e().read(). Now the problem
    is: As soon as I pass the htmllib.HTMLPar ser UTF-8 code, it chokes. The
    code is something like this:

    prs = self.parserclas s(formatter.Nul lFormatter())
    prs.init()
    prs.feed(websit e)
    self.__result = prs.get()
    prs.close()

    Now when I take "website" directly from the parser, everything is fine.
    However I want to do some modifications before I parse it, namely UTF-8
    modifications in the style:

    website = website.replace (u"föö", u"bär")

    Therefore, after fetching the web site content, I have to convert it to
    UTF-8 first, modify it and convert it back:

    website = website.decode( "latin1")
    website = website.replace (u"föö", u"bär")
    website = website.encode( "latin1")

    This is incredibly ugly IMHO, as I would really like the parser to just
    accept UTF-8 input. However when I omit the reecoding to latin1:

    File "CachedWebParse r.py", line 13, in __init__
    self.__process( website)
    File "CachedWebParse r.py", line 55, in __process
    prs.feed(websit e)
    File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
    File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_star ttag(i)
    File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_r ef, attrvalue)
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xfc in position 0:
    ordinal not in range(128)

    Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
    input - which should (again, IMHO) be the absolute standard for such a
    new language.

    Can I do something about it?

    Regards,
    Johannes

    --
    "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    <48d8bf1d$0$751 0$5402220f@news .sunrise.ch>
  • Terry Reedy

    #2
    Re: Python HTML parser chokes on UTF-8 input

    Johannes Bauer wrote:
    Hello group,
    >
    I'm trying to use a htmllib.HTMLPar ser derivate class to parse a website
    which I fetched via
    httplib.HTTPCon nection().reque st().getrespons e().read(). Now the problem
    is: As soon as I pass the htmllib.HTMLPar ser UTF-8 code, it chokes. The
    code is something like this:
    I believe you are confusing unicode with unicode encoded into bytes with
    the UTF-8 encoding. Having a problem feeding a unicode string, not
    'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.
    >
    prs = self.parserclas s(formatter.Nul lFormatter())
    prs.init()
    prs.feed(websit e)
    self.__result = prs.get()
    prs.close()
    >
    Now when I take "website" directly from the parser, everything is fine.
    However I want to do some modifications before I parse it, namely UTF-8
    modifications in the style:
    >
    website = website.replace (u"föö", u"bär")
    >
    Therefore, after fetching the web site content, I have to convert it to
    UTF-8 first, modify it and convert it back:
    >
    website = website.decode( "latin1") # produces unicode
    website = website.replace (u"föö", u"bär") #remains unicode
    website = website.encode( "latin1") # produces byte string in the latin-1 encoding
    >
    This is incredibly ugly IMHO, as I would really like the parser to just
    accept UTF-8 input.
    To me, code that works is prettier than code that does not.

    In 3.0, text strings are unicode, and I believe that is what the parser
    now accepts.
    >However when I omit the reecoding to latin1:
    >
    File "CachedWebParse r.py", line 13, in __init__
    self.__process( website)
    File "CachedWebParse r.py", line 55, in __process
    prs.feed(websit e)
    File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
    File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_star ttag(i)
    File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_r ef, attrvalue)
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xfc in position 0:
    ordinal not in range(128)
    When you do not bother to specify some other encoding in an encoding
    operation, sgmllib or something deeper in Python tries the default
    encoding, which does not work. Stop being annoyed and tell the
    interpreter what you want. It is not a mind-reader.
    Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
    input - which should (again, IMHO) be the absolute standard for such a
    new language.
    The first version of Python came out in 1989, I believe, years before
    unicode. One of the features of the new 3.0 version is that is uses
    unicode as the standard for text.

    Terry Jan Reedy

    Comment

    • Johannes Bauer

      #3
      Re: Python HTML parser chokes on UTF-8 input

      Terry Reedy schrieb:
      Johannes Bauer wrote:
      >Hello group,
      >>
      >I'm trying to use a htmllib.HTMLPar ser derivate class to parse a website
      >which I fetched via
      >httplib.HTTPCo nnection().requ est().getrespon se().read(). Now the problem
      >is: As soon as I pass the htmllib.HTMLPar ser UTF-8 code, it chokes. The
      >code is something like this:
      >
      I believe you are confusing unicode with unicode encoded into bytes with
      the UTF-8 encoding. Having a problem feeding a unicode string, not
      'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.
      I also believe I am. Could you please elaborate further?

      Do I understand correctly when saying that type 'str' has no associated
      default encoding, but type 'unicode' does? Does this mean that really
      the only way of coping with that stuff is doing what I've been doing?
      >This is incredibly ugly IMHO, as I would really like the parser to just
      >accept UTF-8 input.
      >
      To me, code that works is prettier than code that does not.
      >
      In 3.0, text strings are unicode, and I believe that is what the parser
      now accepts.
      Well, yes, I suppose working code is nicer than non-working code.
      However I am sure you will agree that explicit encoding conversions are
      cumbersome and error-prone.
      >UnicodeDecodeE rror: 'ascii' codec can't decode byte 0xfc in position 0:
      >ordinal not in range(128)
      >
      When you do not bother to specify some other encoding in an encoding
      operation, sgmllib or something deeper in Python tries the default
      encoding, which does not work. Stop being annoyed and tell the
      interpreter what you want. It is not a mind-reader.
      How do I tell the interpreter to parse the strings I pass to it as
      unicode? The way I did or is there some better way?
      >Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
      >input - which should (again, IMHO) be the absolute standard for such a
      >new language.
      >
      The first version of Python came out in 1989, I believe, years before
      unicode. One of the features of the new 3.0 version is that is uses
      unicode as the standard for text.
      Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice,
      do you know when will approximately be ready?

      Regards,
      Johannes

      --
      "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
      verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
      -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
      <48d8bf1d$0$751 0$5402220f@news .sunrise.ch>

      Comment

      • Terry Reedy

        #4
        Re: Python HTML parser chokes on UTF-8 input

        Johannes Bauer wrote:
        Terry Reedy schrieb:
        >Johannes Bauer wrote:
        >>Hello group,
        >>>
        >>I'm trying to use a htmllib.HTMLPar ser derivate class to parse a website
        >>which I fetched via
        >>httplib.HTTPC onnection().req uest().getrespo nse().read(). Now the problem
        >>is: As soon as I pass the htmllib.HTMLPar ser UTF-8 code, it chokes. The
        >>code is something like this:
        >I believe you are confusing unicode with unicode encoded into bytes with
        >the UTF-8 encoding. Having a problem feeding a unicode string, not
        >'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.
        >
        I also believe I am. Could you please elaborate further?
        I am a unicode neophyte. My source of info is the first 3 or so
        chapters of the unicode specification.

        I recommend that or other sites for other questions. It took me more
        than one reading of the same topics in different texts to pretty well
        'get it'
        Do I understand correctly when saying that type 'str' has no associated
        default encoding, but type 'unicode' does?
        I am not sure what you mean. Unicode strings in Python are internally
        stored in USC-2 or UCS-4 format.
        Does this mean that really
        the only way of coping with that stuff is doing what I've been doing?
        Having two text types in 2.x was necessary as a transition strategy but
        has also been something of a mess. You did it one way. Jerry gave you
        an alternative that I could not have explained. Your choice. Or use 3.0.

        ...
        Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice,
        do you know when will approximately be ready?
        For my current purposes, it is ready enough. Developers *really* hope
        to get 3.0 final out by mid-December. The schedule was pushed back
        because a) the outside world has not completely and cleanly switched to
        unicode text and b) some people who just started with the release
        candidate have found import bugs that earlier testers did not. It still
        needs more testing from more different users (hint, hint).

        Terry Jan Reedy

        Comment

        • Marc 'BlackJack' Rintsch

          #5
          Re: Python HTML parser chokes on UTF-8 input

          On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote:
          Terry Reedy schrieb:
          >I believe you are confusing unicode with unicode encoded into bytes
          >with the UTF-8 encoding. Having a problem feeding a unicode string,
          >not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte
          >string.
          >
          I also believe I am. Could you please elaborate further?
          >
          Do I understand correctly when saying that type 'str' has no associated
          default encoding, but type 'unicode' does?
          `str` doesn't know an encoding. The content could be any byte data
          anyway. And `unicode` doesn't know an encoding either, it is unicode
          characters. How they are represented internally is not the business of
          the programmer. If you want operate with unicode characters you have to
          decode a byte string (`str`) with the appropriate encoding. If you want
          feed `unicode` to something that expects bytes and not unicode characters
          you have to encode again.
          >>This is incredibly ugly IMHO, as I would really like the parser to
          >>just accept UTF-8 input.
          It accepts UTF-8 input but not `unicode` objects.
          However I am sure you will agree that explicit encoding conversions are
          cumbersome and error-prone.
          But implicit conversions are impossible because the interpreter doesn't
          know which encoding to use and refuses to guess. Implicit and guessed
          conversions are error prone too.

          Ciao,
          Marc 'BlackJack' Rintsch

          Comment

          • John Nagle

            #6
            Re: Python HTML parser chokes on UTF-8 input

            Johannes Bauer wrote:
            Hello group,
            >
            I'm trying to use a htmllib.HTMLPar ser derivate class to parse a website
            which I fetched via
            httplib.HTTPCon nection().reque st().getrespons e().read(). Now the problem
            is: As soon as I pass the htmllib.HTMLPar ser UTF-8 code, it chokes. The
            code is something like this:
            Try BeautifulSoup. It actually understands how to detect the encoding
            of an HTML file (there are three different ways that information can be
            expressed), and will shift modes accordingly.

            This is an ugly problem. Sometimes, it's necessary to parse part of
            the file, discover that the rest of the file has a non-ASCII encoding,
            and restart the parse from the beginning. BeautifulSoup has the
            machinery for that.

            John Nagle

            Comment

            Working...