unicode wrap unicode object?

**Fredrik Lundh** · Apr 8 '06, 06:35 AM

Re: unicode wrap unicode object?

"ygao" <ygao2004@gmail .com> wrote:
[color=blue][color=green][color=darkred]
> >>> import sys
> >>> sys.setdefaulte ncoding("utf-8")[/color][/color][/color]

hmm. what kind of bootleg python is that ?
[color=blue][color=green][color=darkred]
>>> import sys
>>> sys.setdefaulte ncoding("utf-8")[/color][/color][/color]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'module' object has no attribute 'setdefaultenco ding'

(you're not supposed to change the default encoding. don't
do that; it'll only cause problems in the long run).
[color=blue][color=green][color=darkred]
> >>> s='\xe9\xab\x98 ' #this uff-8 string
> >>> ss=U'\xe9\xab\x 98'
> >>> s[/color][/color]
> '\xe9\xab\x98'[color=green][color=darkred]
> >>> ss[/color][/color]
> u'\xe9\xab\x98'[color=green][color=darkred]
> >>>[/color][/color]
> how do I get ss from s?
> Can there be a way do this?[/color]

you have UTF-8 *bytes* in a Unicode text string? sounds like
someone's made a mistake earlier on...

anyway, iso-8859-1 is, in practice, a null transform, that simply
converts unicode characters to bytes:
[color=blue][color=green][color=darkred]
>>> s = ss.encode("iso-8859-1")
>>> s[/color][/color][/color]
'\xe9\xab\x98'[color=blue][color=green][color=darkred]
>>> s.decode("utf-8")[/color][/color][/color]
u'\u9ad8'[color=blue][color=green][color=darkred]
>>> import unicodedata
>>> unicodedata.nam e(s.decode("utf-8"))[/color][/color][/color]
'CJK UNIFIED IDEOGRAPH-9AD8'

but it's probably better to fix the code that puts UTF-8 data in your
Unicode strings (look for bogus iso-8859-1 conversions)

</F>

**ygao** · Apr 8 '06, 08:35 AM

Re: unicode wrap unicode object?

sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.

[color=blue][color=green][color=darkred]
>>> import sys
>>> reload(sys)
>>> sys.setdefaulte ncoding("utf-8")
>>> s='\xe9\xab\x98 ' #this uff-8 string
>>> ss=U'\xe9\xab\x 98'
>>> ss1=ss.encode(' unicode_escape' ).decode('strin g_escape')
>>> s1=s.decode('un icode_escape')
>>> s1==ss[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>> ss1==s[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

**ygao** · Apr 8 '06, 08:35 AM

Re: unicode wrap unicode object?

sorry,my poor english.
I got a solution from others.
I must use utf-8 for chinese.[color=blue][color=green][color=darkred]
>>> import sys
>>> reload(sys)
>>> sys.setdefaulte ncoding("utf-8")
>>> s='\xe9\xab\x98 ' #this uff-8 string
>>> ss=U'\xe9\xab\x 98'
>>> ss1=ss.encode(' unicode_escape' ).decode('strin g_escape')
>>> s1=s.decode('un icode_escape')
>>> s1==ss[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>> ss1==s[/color][/color][/color]
True

**Fredrik Lundh** · Apr 8 '06, 09:05 AM

Re: unicode wrap unicode object?

"ygao" wrpte_
[color=blue]
> I must use utf-8 for chinese.[/color]

yeah, but you shouldn't store it in a *Unicode* string. Unicode strings
are designed to hold things that you've already decoded (that is, your
chinese text), not the raw UTF-8 bytes.

if you store the UTF-8 in an ordinary 8-bit string instead, you can use
the unicode constructor to convert things properly:

b = "... some utf-8 data ..."

# turn it into a unicode string
u = unicode(b, "utf-8")

# ... do something with it ...

# turn it back into a utf-8 string
s = u.encode("utf-8")

# or use some other encoding
s = u.encode("big5" )

e.g.
[color=blue][color=green][color=darkred]
>>> b = '\xe9\xab\x98'
>>> u = unicode(b, "utf-8")
>>> u.encode("utf-8")[/color][/color][/color]
'\xe9\xab\x98'[color=blue][color=green][color=darkred]
>>> u.encode("big5" )[/color][/color][/color]
'\xb0\xaa'

</F>

**ygao** · Apr 8 '06, 09:05 AM

Re: unicode wrap unicode object?

thanks for your advice.

**Martin v. Löwis** · Apr 8 '06, 10:05 AM

Re: unicode wrap unicode object?

ygao wrote:[color=blue]
> I must use utf-8 for chinese.[/color]

Sure. But please don't do that:
[color=blue][color=green][color=darkred]
>>>> import sys
>>>> reload(sys)
>>>> sys.setdefaulte ncoding("utf-8")[/color][/color][/color]

As Fredrik says, you should really avoid changing the
default encoding.
[color=blue][color=green][color=darkred]
>>>> s='\xe9\xab\x98 ' #this uff-8 string
>>>> ss=U'\xe9\xab\x 98'
>>>> ss1=ss.encode(' unicode_escape' ).decode('strin g_escape')
>>>> s1=s.decode('un icode_escape')
>>>> s1==ss[/color][/color]
> True[color=green][color=darkred]
>>>> ss1==s[/color][/color]
> True[/color]

Ok. But how about that:

py> s='\xe9\xab\x98 '
py> ss=u'\u9ad8'
py> s1=s.decode('ut f-8')
py> s1==ss
True

Here, ss is a single character, which uses 3 bytes in UTF-8.
In your example, ss has three characters, which are not Chinese,
but European.

Regards,
Martin

unicode wrap unicode object?

unicode wrap unicode object?

Comment

Comment

Comment

Comment

Comment

Comment