python-unicode doesn't support >65535 symbols?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • gabor

    python-unicode doesn't support >65535 symbols?

    hi,

    today i made some tests...

    i tested some unicode symbols, that are above the 16bit limit
    (gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
    ..

    i played around with iconv and so on,
    so at the end i created an utf8 encoded text file,
    with the text "Marrakesh" ,
    where the second 'a' wes replaced with
    GOTHIC_LETTER_A HSA (unicode-value:0x10330).

    (i simply wrote the text file "Marrakesh" , used iconv to convert it to
    utf32big-endian, and replaced the character in hexedit, then converted
    with iconv back to utf8).

    now i started python:
    [color=blue][color=green][color=darkred]
    >>> data = open("utf8.txt" ).read()
    >>> data[/color][/color][/color]
    'Marr\xf0\x90\x 8c\xb0kesh'[color=blue][color=green][color=darkred]
    >>> text = data.decode("ut f8")
    >>> text[/color][/color][/color]
    u'Marr\U0001033 0kesh'

    so far it seemed ok.
    then i did:
    [color=blue][color=green][color=darkred]
    >>> len(text)[/color][/color][/color]
    10

    this is wrong. the length should be 9.
    and why?
    [color=blue][color=green][color=darkred]
    >>> text[0][/color][/color][/color]
    u'M'[color=blue][color=green][color=darkred]
    >>> text[1][/color][/color][/color]
    u'a'[color=blue][color=green][color=darkred]
    >>> text[2][/color][/color][/color]
    u'r'[color=blue][color=green][color=darkred]
    >>> text[3][/color][/color][/color]
    u'r'[color=blue][color=green][color=darkred]
    >>> text[4][/color][/color][/color]
    u'\ud800'[color=blue][color=green][color=darkred]
    >>> text[5][/color][/color][/color]
    u'\udf30'[color=blue][color=green][color=darkred]
    >>> text[6][/color][/color][/color]
    u'k'[color=blue][color=green][color=darkred]
    >>>[/color][/color][/color]

    so text[3] (which should be \U00010330),
    was split to 2 16bit values (text[3] and text[4]).

    i don't understand.
    if tthe representation of 'text' is correct, why is the length wrong?

    btw. i understand that it's a very exotic character, but i tried for
    example kwrite and gedit, and none of the was able to display the
    symbol, but both successfully identified it as ONE unknown symbol.

    thanks,
    gabor




  • Michael Hudson

    #2
    Re: python-unicode doesn't support >65535 symbols?

    gabor <gabor@z10n.net > writes:
    [color=blue]
    > i played around with iconv and so on,
    > so at the end i created an utf8 encoded text file,
    > with the text "Marrakesh" ,
    > where the second 'a' wes replaced with
    > GOTHIC_LETTER_A HSA (unicode-value:0x10330).
    >
    > (i simply wrote the text file "Marrakesh" , used iconv to convert it to
    > utf32big-endian, and replaced the character in hexedit, then converted
    > with iconv back to utf8).
    >
    > now i started python:
    >[color=green][color=darkred]
    > >>> data = open("utf8.txt" ).read()
    > >>> data[/color][/color]
    > 'Marr\xf0\x90\x 8c\xb0kesh'[color=green][color=darkred]
    > >>> text = data.decode("ut f8")
    > >>> text[/color][/color]
    > u'Marr\U0001033 0kesh'
    >
    > so far it seemed ok.
    > then i did:
    >[color=green][color=darkred]
    > >>> len(text)[/color][/color]
    > 10
    >
    > this is wrong. the length should be 9.[/color]

    I suspect you have a "narrow unicode" build of Python. You can make
    yourself a "wide unicode" build easily enough.
    [color=blue]
    > and why?
    >[color=green][color=darkred]
    > >>> text[0][/color][/color]
    > u'M'[color=green][color=darkred]
    > >>> text[1][/color][/color]
    > u'a'[color=green][color=darkred]
    > >>> text[2][/color][/color]
    > u'r'[color=green][color=darkred]
    > >>> text[3][/color][/color]
    > u'r'[color=green][color=darkred]
    > >>> text[4][/color][/color]
    > u'\ud800'[color=green][color=darkred]
    > >>> text[5][/color][/color]
    > u'\udf30'[color=green][color=darkred]
    > >>> text[6][/color][/color]
    > u'k'[color=green][color=darkred]
    > >>>[/color][/color]
    >
    > so text[3] (which should be \U00010330),
    > was split to 2 16bit values (text[3] and text[4]).
    >
    > i don't understand.
    > if tthe representation of 'text' is correct, why is the length wrong?[/color]

    I expect that this has to do with surrogates or some other unicode
    thing that's beyond my understanding.. .

    Cheers,
    mwh

    --
    It's actually a corruption of "starling". They used to be carried.
    Since they weighed a full pound (hence the name), they had to be
    carried by two starlings in tandem, with a line between them.
    -- Alan J Rosenthal explains "Pounds Sterling" on asr

    Comment

    • Andrew Clover

      #3
      Re: python-unicode doesn't support &gt;65535 symbols?

      gabor <gabor@z10n.net > wrote:
      [color=blue]
      > so text[3] (which should be \U00010330),
      > was split to 2 16bit values (text[3] and text[4]).[/color]

      The default encoding for native Unicode strings in Python in UTF-16, which
      cannot hold the extended planes beyond 0xFFFF in a single character. Instead,
      it uses two 'surrogate' characters. Bit of a nasty hack, but that's what
      Unicode does and there's nothing can be done about it now.

      Python can be compiled to use UCS-4 for native Unicode strings if you prefer.
      Then every conceptual 'character' from the Unicode repertoire will be one
      item long. It'll eat more memory too of course.
      [color=blue]
      > if tthe representation of 'text' is correct, why is the length wrong?[/color]

      The representation of 'text' you are seeing is just the nicely-readable
      version output by Python 2.2+. Despite the \U sequence, it is actually still
      stored internally as two UTF-16 surrogates. You'll see this if you enter
      '\U00012345' into Python 2.0 or 2.1, which don't use the \U form to output
      strings.

      --
      Andrew Clover
      mailto:and@doxd esk.com

      Comment

      • Rainer Deyke

        #4
        Re: python-unicode doesn't support &gt;65535 symbols?

        Andrew Clover wrote:[color=blue]
        > gabor <gabor@z10n.net > wrote:
        >[color=green]
        >> so text[3] (which should be \U00010330),
        >> was split to 2 16bit values (text[3] and text[4]).[/color]
        >
        > The default encoding for native Unicode strings in Python in UTF-16,
        > which cannot hold the extended planes beyond 0xFFFF in a single
        > character.[/color]

        That's not quite right. UTF-16 encodes unicode characters as either single
        16 bit values and pairs of 16 bit values. However, one character is still
        one character.

        Python makes the mistake of exposing the internal representation instead of
        the logical value of unicode objects. This means that, aside from space
        optimization, unicode objects have no advantage over UTF-8 encoded plain
        strings for storing unicode text.


        --
        Rainer Deyke - rainerd@eldwood .com - http://eldwood.com


        Comment

        • Martin v. Löwis

          #5
          Re: python-unicode doesn't support &gt;65535 symbols?

          "Rainer Deyke" <rainerd@eldwoo d.com> writes:
          [color=blue]
          > Python makes the mistake of exposing the internal representation instead of
          > the logical value of unicode objects. This means that, aside from space
          > optimization, unicode objects have no advantage over UTF-8 encoded plain
          > strings for storing unicode text.[/color]

          That is not true. First, it is not "Python", but a specific Python
          configuration - in "wide Unicode" builds, it uses UCS-4 internally.

          In either build, len() and indexing addresses code units, not
          characters: that is true.

          However, it is not true that there is no advantage over UTF-8 encoded
          byte strings. Instead, there are several advantages:
          - In a UCS-4 build, Unicode characters and code units are in a 1:1
          relationship
          - In a UCS-2 build, Unicode characters and code units are in a 1:1
          relationship as long as the application only ever processes BMP
          characters.
          - In either case, a Unicode object has inherent information about the
          character set, which a UTF-8 byte string does not have. IOW, you know
          what a Unicode object is, but you don't know (inherently) whether a
          byte string is UTF-8.

          Regards,
          Martin

          Comment

          Working...