Hello fellow pythonists,
I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)
I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.
The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
After that, an urllib request is sent with this encoded string to the
web service
First problem, how to determine the encoding of the return ?
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
I end up with an UnicodeDecodeEr ror. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.
Second problem, if I try an
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.
Then, looking for help through google, I have found this post:
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
But what is strange, is that since I did that, even without using this
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeEr ro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.
Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw
I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)
I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.
The "source" string is first encoded to "latin1" after a passage into
unicode.normali ze(), as urllib.quote() cannot work on unicode
>>srcText=unico dedata.normaliz e('NFKD',srcTex t).encode('lati n1','ignore')
web service
>>con=urllib2.R equest(self.url , headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host ='http://translate.googl e.com')
>>req=urllib2.u rlopen(con)
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
>>for line in req:
> ret=ret+string. replace(line.st rip(),'\n',chr( 10))
>>for line in req:
> ret=ret+string. replace(line.st rip(),'\n',chr( 10))
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.
Second problem, if I try an
>>print line
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.
Then, looking for help through google, I have found this post:
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutS treamEncoder(sy s.stdout, 'utf-8')
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeEr ro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.
Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw
Comment