Hello group,
I'm trying to use a htmllib.HTMLPar ser derivate class to parse a website
which I fetched via
httplib.HTTPCon nection().reque st().getrespons e().read(). Now the problem
is: As soon as I pass the htmllib.HTMLPar ser UTF-8 code, it chokes. The
code is something like this:
prs = self.parserclas s(formatter.Nul lFormatter())
prs.init()
prs.feed(websit e)
self.__result = prs.get()
prs.close()
Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:
website = website.replace (u"föö", u"bär")
Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:
website = website.decode( "latin1")
website = website.replace (u"föö", u"bär")
website = website.encode( "latin1")
This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input. However when I omit the reecoding to latin1:
File "CachedWebParse r.py", line 13, in __init__
self.__process( website)
File "CachedWebParse r.py", line 55, in __process
prs.feed(websit e)
File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_star ttag(i)
File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_r ef, attrvalue)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)
Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.
Can I do something about it?
Regards,
Johannes
--
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
-- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
<48d8bf1d$0$751 0$5402220f@news .sunrise.ch>
I'm trying to use a htmllib.HTMLPar ser derivate class to parse a website
which I fetched via
httplib.HTTPCon nection().reque st().getrespons e().read(). Now the problem
is: As soon as I pass the htmllib.HTMLPar ser UTF-8 code, it chokes. The
code is something like this:
prs = self.parserclas s(formatter.Nul lFormatter())
prs.init()
prs.feed(websit e)
self.__result = prs.get()
prs.close()
Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:
website = website.replace (u"föö", u"bär")
Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:
website = website.decode( "latin1")
website = website.replace (u"föö", u"bär")
website = website.encode( "latin1")
This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input. However when I omit the reecoding to latin1:
File "CachedWebParse r.py", line 13, in __init__
self.__process( website)
File "CachedWebParse r.py", line 55, in __process
prs.feed(websit e)
File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_star ttag(i)
File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_r ef, attrvalue)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)
Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.
Can I do something about it?
Regards,
Johannes
--
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
-- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
<48d8bf1d$0$751 0$5402220f@news .sunrise.ch>
Comment