compare unicode to non-unicode strings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Asterix

    compare unicode to non-unicode strings

    how could I test that those 2 strings are the same:

    'séd' (repr is 's\\xc3\\xa9d')

    u'séd' (repr is u's\\xe9d')
  • John Machin

    #2
    Re: compare unicode to non-unicode strings

    On Aug 31, 11:04 pm, Asterix <aste...@lagaul e.orgwrote:
    how could I test that those 2 strings are the same:
    >
    'séd' (repr is 's\\xc3\\xa9d')
    No, the repr is 's\xc3\xa9d'.
    >
    u'séd' (repr is u's\\xe9d')
    No, the repr is u's\xe9d'.

    To answer your question:



    Comment

    • John Machin

      #3
      Re: compare unicode to non-unicode strings

      On Aug 31, 11:04 pm, Asterix <aste...@lagaul e.orgwrote:
      how could I test that those 2 strings are the same:
      >
      'séd' (repr is 's\\xc3\\xa9d')
      >
      u'séd' (repr is u's\\xe9d')
      [note: your reprs are wrong; change the \\ to \]

      You need to decode the non-unicode string and compare the result with
      the unicode string. You need to know the encoding used for the non-
      unicode string. In the example that you gave, it's about 99.99% likely
      that it's UTF-8.
      >>'s\xc3\xa9d'. decode('utf8')
      u's\xe9d'
      >>u's\xe9d'.enc ode('utf8')
      's\xc3\xa9d'
      >>>
      HTH,
      John

      Comment

      • Fredrik Lundh

        #4
        Re: compare unicode to non-unicode strings

        Asterix wrote:
        how could I test that those 2 strings are the same:
        >
        'séd' (repr is 's\\xc3\\xa9d')
        >
        u'séd' (repr is u's\\xe9d')
        determine what encoding the former string is using (looks like UTF-8),
        and convert it to Unicode before doing the comparision.
        >>b = 's\xc3\xa9d'
        >>u = u's\xe9d'
        >>b
        's\xc3\xa9d'
        >>u
        u's\xe9d'
        >>unicode(b, "utf-8")
        u's\xe9d'
        >>unicode(b, "utf-8") == u
        True

        </F>

        Comment

        • =?Utf-8?Q?M=C3=A9ta-MCI_=28MVP=29?=

          #5
          Re: compare unicode to non-unicode strings

          Par Toutatis !
          Si tu avais posé la question à Ordralphabétix , ou sur un des ng français
          consacrés à Python, au lieu de refaire "La grande Traversée", la réponse
          aurait peut-être été plus rapide.

          @-salutations
          --
          Michel Claveau


          Comment

          • Matt Nordhoff

            #6
            Re: compare unicode to non-unicode strings

            Asterix wrote:
            how could I test that those 2 strings are the same:
            >
            'séd' (repr is 's\\xc3\\xa9d')
            >
            u'séd' (repr is u's\\xe9d')
            You may also want to look at unicodedata.nor malize(). For example, é can
            be represented multiple ways:
            >>import unicodedata
            >>unicodedata.n ormalize('NFC', u'é')
            u'\xe9'
            >>unicodedata.n ormalize('NFD', u'é')
            u'e\u0301'
            >>u'\xe9' == u'e\u0301'
            False

            The first form is "composed", just being U+00E9 (LATIN SMALL LETTER E
            WITH ACUTE). The second form is "decomposed ", being made up of U+0065
            (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

            Even though they represent the same thing to a human, they don't compare
            as equal. But if you normalize them to the same form, they will.

            For more information, look at the unicodedata module's documentation:
            <http://docs.python.org/lib/module-unicodedata.htm l>
            --

            Comment

            Working...