utf8 encoding problem

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Wichert Akkerman

    utf8 encoding problem

    I'm struggling with what should be a trivial problem but I can't seem to
    come up with a proper solution: I am working on a CGI that takes utf-8
    input from a browser. The input is nicely encoded so you get something
    like this:

    firstname=t%C3% A9s

    where %C3CA9 is a single character in utf-8 encoding. Passing this
    through urllib.unquote does not help:
    [color=blue][color=green][color=darkred]
    >>> urllib.unquote( u't%C3%A9st')[/color][/color][/color]
    u't%C3%A9st'

    The problem turned out to be that urllib.unquote( ) process processes
    its input character by character which breaks when it tries to call
    chr() for a character: it gets a character which is not valid ascii
    (outside the legal range) or valid unicode (it's only half a utf-8
    character) and as a result it fails:
    [color=blue][color=green][color=darkred]
    >>> chr(195) + u""[/color][/color][/color]
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


    I can't seem to find a working method to do this conversion correctly.
    Can someone point me in the right direction? (and please cc me on
    replies since I'm not currently subscribed to this list/newsgroup).

    Wichert.

    --
    Wichert Akkerman <wichert@wiggy. net> It is simple to make things.
    http://www.wiggy.net/ It is hard to make things simple.


  • Erik Max Francis

    #2
    Re: utf8 encoding problem

    Wichert Akkerman wrote:
    [color=blue]
    > I'm struggling with what should be a trivial problem but I can't seem
    > to
    > come up with a proper solution: I am working on a CGI that takes utf-8
    > input from a browser. The input is nicely encoded so you get something
    > like this:
    >
    > firstname=t%C3% A9s
    >
    > where %C3CA9 is a single character in utf-8 encoding. Passing this
    > through urllib.unquote does not help:
    >[color=green][color=darkred]
    > >>> urllib.unquote( u't%C3%A9st')[/color][/color]
    > u't%C3%A9st'[/color]

    Unquote it as a normal string, then convert it to Unicode.
    [color=blue][color=green][color=darkred]
    >>> import urllib
    >>> x = 't%C3%A9s'
    >>> y = urllib.unquote( x)
    >>> y[/color][/color][/color]
    't\xc3\xa9s'[color=blue][color=green][color=darkred]
    >>> z = unicode(y, 'utf-8')
    >>> z[/color][/color][/color]
    u't\xe9s'

    --
    __ Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
    / \ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
    \__/ I do not promise to consider race or religion in my appointments.
    I promise only that I will not consider them. -- John F. Kennedy

    Comment

    Working...