unicode wrap unicode object?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • ygao

    unicode wrap unicode object?

    >>> import sys[color=blue][color=green][color=darkred]
    >>> sys.setdefaulte ncoding("utf-8")
    >>> s='\xe9\xab\x98 ' #this uff-8 string
    >>> ss=U'\xe9\xab\x 98'
    >>> s[/color][/color][/color]
    '\xe9\xab\x98'[color=blue][color=green][color=darkred]
    >>> ss[/color][/color][/color]
    u'\xe9\xab\x98'[color=blue][color=green][color=darkred]
    >>>[/color][/color][/color]
    how do I get ss from s?
    Can there be a way do this?
    thanks!

  • Fredrik Lundh

    #2
    Re: unicode wrap unicode object?

    "ygao" <ygao2004@gmail .com> wrote:
    [color=blue][color=green][color=darkred]
    > >>> import sys
    > >>> sys.setdefaulte ncoding("utf-8")[/color][/color][/color]

    hmm. what kind of bootleg python is that ?
    [color=blue][color=green][color=darkred]
    >>> import sys
    >>> sys.setdefaulte ncoding("utf-8")[/color][/color][/color]
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    AttributeError: 'module' object has no attribute 'setdefaultenco ding'

    (you're not supposed to change the default encoding. don't
    do that; it'll only cause problems in the long run).
    [color=blue][color=green][color=darkred]
    > >>> s='\xe9\xab\x98 ' #this uff-8 string
    > >>> ss=U'\xe9\xab\x 98'
    > >>> s[/color][/color]
    > '\xe9\xab\x98'[color=green][color=darkred]
    > >>> ss[/color][/color]
    > u'\xe9\xab\x98'[color=green][color=darkred]
    > >>>[/color][/color]
    > how do I get ss from s?
    > Can there be a way do this?[/color]

    you have UTF-8 *bytes* in a Unicode text string? sounds like
    someone's made a mistake earlier on...

    anyway, iso-8859-1 is, in practice, a null transform, that simply
    converts unicode characters to bytes:
    [color=blue][color=green][color=darkred]
    >>> s = ss.encode("iso-8859-1")
    >>> s[/color][/color][/color]
    '\xe9\xab\x98'[color=blue][color=green][color=darkred]
    >>> s.decode("utf-8")[/color][/color][/color]
    u'\u9ad8'[color=blue][color=green][color=darkred]
    >>> import unicodedata
    >>> unicodedata.nam e(s.decode("utf-8"))[/color][/color][/color]
    'CJK UNIFIED IDEOGRAPH-9AD8'

    but it's probably better to fix the code that puts UTF-8 data in your
    Unicode strings (look for bogus iso-8859-1 conversions)

    </F>



    Comment

    • ygao

      #3
      Re: unicode wrap unicode object?

      sorry,my poor english.
      I got a solution from others.
      I must use utf-8 for chinese.

      [color=blue][color=green][color=darkred]
      >>> import sys
      >>> reload(sys)
      >>> sys.setdefaulte ncoding("utf-8")
      >>> s='\xe9\xab\x98 ' #this uff-8 string
      >>> ss=U'\xe9\xab\x 98'
      >>> ss1=ss.encode(' unicode_escape' ).decode('strin g_escape')
      >>> s1=s.decode('un icode_escape')
      >>> s1==ss[/color][/color][/color]
      True[color=blue][color=green][color=darkred]
      >>> ss1==s[/color][/color][/color]
      True[color=blue][color=green][color=darkred]
      >>>[/color][/color][/color]

      Comment

      • ygao

        #4
        Re: unicode wrap unicode object?

        sorry,my poor english.
        I got a solution from others.
        I must use utf-8 for chinese.[color=blue][color=green][color=darkred]
        >>> import sys
        >>> reload(sys)
        >>> sys.setdefaulte ncoding("utf-8")
        >>> s='\xe9\xab\x98 ' #this uff-8 string
        >>> ss=U'\xe9\xab\x 98'
        >>> ss1=ss.encode(' unicode_escape' ).decode('strin g_escape')
        >>> s1=s.decode('un icode_escape')
        >>> s1==ss[/color][/color][/color]
        True[color=blue][color=green][color=darkred]
        >>> ss1==s[/color][/color][/color]
        True

        Comment

        • Fredrik Lundh

          #5
          Re: unicode wrap unicode object?

          "ygao" wrpte_
          [color=blue]
          > I must use utf-8 for chinese.[/color]

          yeah, but you shouldn't store it in a *Unicode* string. Unicode strings
          are designed to hold things that you've already decoded (that is, your
          chinese text), not the raw UTF-8 bytes.

          if you store the UTF-8 in an ordinary 8-bit string instead, you can use
          the unicode constructor to convert things properly:

          b = "... some utf-8 data ..."

          # turn it into a unicode string
          u = unicode(b, "utf-8")

          # ... do something with it ...

          # turn it back into a utf-8 string
          s = u.encode("utf-8")

          # or use some other encoding
          s = u.encode("big5" )

          e.g.
          [color=blue][color=green][color=darkred]
          >>> b = '\xe9\xab\x98'
          >>> u = unicode(b, "utf-8")
          >>> u.encode("utf-8")[/color][/color][/color]
          '\xe9\xab\x98'[color=blue][color=green][color=darkred]
          >>> u.encode("big5" )[/color][/color][/color]
          '\xb0\xaa'

          </F>



          Comment

          • ygao

            #6
            Re: unicode wrap unicode object?

            thanks for your advice.

            Comment

            • Martin v. Löwis

              #7
              Re: unicode wrap unicode object?

              ygao wrote:[color=blue]
              > I must use utf-8 for chinese.[/color]

              Sure. But please don't do that:
              [color=blue][color=green][color=darkred]
              >>>> import sys
              >>>> reload(sys)
              >>>> sys.setdefaulte ncoding("utf-8")[/color][/color][/color]

              As Fredrik says, you should really avoid changing the
              default encoding.
              [color=blue][color=green][color=darkred]
              >>>> s='\xe9\xab\x98 ' #this uff-8 string
              >>>> ss=U'\xe9\xab\x 98'
              >>>> ss1=ss.encode(' unicode_escape' ).decode('strin g_escape')
              >>>> s1=s.decode('un icode_escape')
              >>>> s1==ss[/color][/color]
              > True[color=green][color=darkred]
              >>>> ss1==s[/color][/color]
              > True[/color]

              Ok. But how about that:

              py> s='\xe9\xab\x98 '
              py> ss=u'\u9ad8'
              py> s1=s.decode('ut f-8')
              py> s1==ss
              True

              Here, ss is a single character, which uses 3 bytes in UTF-8.
              In your example, ss has three characters, which are not Chinese,
              but European.

              Regards,
              Martin

              Comment

              Working...