compare unicode to non-unicode strings

**John Machin** · Aug 31 '08, 01:26 PM

Re: compare unicode to non-unicode strings

On Aug 31, 11:04 pm, Asterix <aste...@lagaul e.orgwrote:

how could I test that those 2 strings are the same:
>
'séd' (repr is 's\\xc3\\xa9d')

No, the repr is 's\xc3\xa9d'.

>
u'séd' (repr is u's\\xe9d')

No, the repr is u's\xe9d'.

To answer your question:

**John Machin** · Aug 31 '08, 01:35 PM

Re: compare unicode to non-unicode strings

On Aug 31, 11:04 pm, Asterix <aste...@lagaul e.orgwrote:

how could I test that those 2 strings are the same:
>
'séd' (repr is 's\\xc3\\xa9d')
>
u'séd' (repr is u's\\xe9d')

[note: your reprs are wrong; change the \\ to \]

You need to decode the non-unicode string and compare the result with
the unicode string. You need to know the encoding used for the non-
unicode string. In the example that you gave, it's about 99.99% likely
that it's UTF-8.

>>'s\xc3\xa9d'. decode('utf8')

u's\xe9d'

>>u's\xe9d'.enc ode('utf8')

's\xc3\xa9d'

>>>

HTH,
John

**Fredrik Lundh** · Aug 31 '08, 01:55 PM

Re: compare unicode to non-unicode strings

Asterix wrote:

how could I test that those 2 strings are the same:
>
'sÃ©d' (repr is 's\\xc3\\xa9d')
>
u'sÃ©d' (repr is u's\\xe9d')

determine what encoding the former string is using (looks like UTF-8),
and convert it to Unicode before doing the comparision.

>>b = 's\xc3\xa9d'
>>u = u's\xe9d'
>>b

's\xc3\xa9d'

>>u

u's\xe9d'

>>unicode(b, "utf-8")

u's\xe9d'

>>unicode(b, "utf-8") == u

True

</F>

**=?Utf-8?Q?M=C3=A9ta-MCI_=28MVP=29?=** · Aug 31 '08, 07:05 PM

Re: compare unicode to non-unicode strings

Par Toutatis !
Si tu avais posÃ© la question Ã OrdralphabÃ©tix , ou sur un des ng franÃ§ais
consacrÃ©s Ã Python, au lieu de refaire "La grande TraversÃ©e", la rÃ©ponse
aurait peut-Ãªtre Ã©tÃ© plus rapide.

@-salutations
--
Michel Claveau

**Matt Nordhoff** · Aug 31 '08, 07:35 PM

Re: compare unicode to non-unicode strings

Asterix wrote:

u'\xe9'

u'e\u0301'

>>u'\xe9' == u'e\u0301'

False

The first form is "composed", just being U+00E9 (LATIN SMALL LETTER E
WITH ACUTE). The second form is "decomposed ", being made up of U+0065
(LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

Even though they represent the same thing to a human, they don't compare
as equal. But if you normalize them to the same form, they will.

For more information, look at the unicodedata module's documentation:
<http://docs.python.org/lib/module-unicodedata.htm l>
--

compare unicode to non-unicode strings

compare unicode to non-unicode strings

Comment

Comment

Comment

Comment

Comment