hi,
today i made some tests...
i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..
i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh" ,
where the second 'a' wes replaced with
GOTHIC_LETTER_A HSA (unicode-value:0x10330).
(i simply wrote the text file "Marrakesh" , used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).
now i started python:
[color=blue][color=green][color=darkred]
>>> data = open("utf8.txt" ).read()
>>> data[/color][/color][/color]
'Marr\xf0\x90\x 8c\xb0kesh'[color=blue][color=green][color=darkred]
>>> text = data.decode("ut f8")
>>> text[/color][/color][/color]
u'Marr\U0001033 0kesh'
so far it seemed ok.
then i did:
[color=blue][color=green][color=darkred]
>>> len(text)[/color][/color][/color]
10
this is wrong. the length should be 9.
and why?
[color=blue][color=green][color=darkred]
>>> text[0][/color][/color][/color]
u'M'[color=blue][color=green][color=darkred]
>>> text[1][/color][/color][/color]
u'a'[color=blue][color=green][color=darkred]
>>> text[2][/color][/color][/color]
u'r'[color=blue][color=green][color=darkred]
>>> text[3][/color][/color][/color]
u'r'[color=blue][color=green][color=darkred]
>>> text[4][/color][/color][/color]
u'\ud800'[color=blue][color=green][color=darkred]
>>> text[5][/color][/color][/color]
u'\udf30'[color=blue][color=green][color=darkred]
>>> text[6][/color][/color][/color]
u'k'[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).
i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?
btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.
thanks,
gabor
today i made some tests...
i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..
i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh" ,
where the second 'a' wes replaced with
GOTHIC_LETTER_A HSA (unicode-value:0x10330).
(i simply wrote the text file "Marrakesh" , used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).
now i started python:
[color=blue][color=green][color=darkred]
>>> data = open("utf8.txt" ).read()
>>> data[/color][/color][/color]
'Marr\xf0\x90\x 8c\xb0kesh'[color=blue][color=green][color=darkred]
>>> text = data.decode("ut f8")
>>> text[/color][/color][/color]
u'Marr\U0001033 0kesh'
so far it seemed ok.
then i did:
[color=blue][color=green][color=darkred]
>>> len(text)[/color][/color][/color]
10
this is wrong. the length should be 9.
and why?
[color=blue][color=green][color=darkred]
>>> text[0][/color][/color][/color]
u'M'[color=blue][color=green][color=darkred]
>>> text[1][/color][/color][/color]
u'a'[color=blue][color=green][color=darkred]
>>> text[2][/color][/color][/color]
u'r'[color=blue][color=green][color=darkred]
>>> text[3][/color][/color][/color]
u'r'[color=blue][color=green][color=darkred]
>>> text[4][/color][/color][/color]
u'\ud800'[color=blue][color=green][color=darkred]
>>> text[5][/color][/color][/color]
u'\udf30'[color=blue][color=green][color=darkred]
>>> text[6][/color][/color][/color]
u'k'[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).
i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?
btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.
thanks,
gabor
Comment