Unicode troubles

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Rodrigo Benenson

    Unicode troubles

    Hi!
    I'm finishing a multiplatform collaborative realtime text editor (something
    like SubEthaEdit but multiplatform and opensource) develloped using
    Python+Twisted as a plugin for Leo.

    Of course as the software run in different platforms in different places,
    text encoding compatibility is an issue.
    So the obvious choice was Tkencoding for client gui, unicode for system
    internals and utf-8 for web outputs.
    But I'm getting serious trouble using Tk and Unicode internals.

    The system, being a text editor use string lenghts and position in the text
    widget as parameters of most of the function critical algorithms.
    Unfortunatelly I had discovered recently that some encoding does not provide
    and equivalence between
    num_of_chars/length_of_strin g/position_in_tex t_widget. As a result each time
    someone press a non ascii key, the references are lose and the other clients
    receive a soup of letters.

    I had read on internet that Unicode was supposed to keep the relation
    num_of_char/string_lenght (and thus the relation
    string_length/num_of_char/position_in_tex t_widget). But this relation does
    not occurs on all my machines.

    Sometimes I get len(u"eló") = 3 (the good result) and other times
    len(u"eló") = 4 (wrong result). These seems indiferent of the OS.

    Could someone explain me this issue ? How I'm supposed to manage this
    problem ? Do I have to compile python with special params to get unicode
    chars and one length unit ?

    Thanks.
    Rodrigo Benenson.
  • Michael Radziej

    #2
    Re: Unicode troubles

    Rodrigo Benenson wrote:
    [color=blue]
    > Sometimes I get len(u"eló") = 3 (the good result) and other times
    > len(u"eló") = 4 (wrong result). These seems indiferent of the OS.[/color]

    There are different ways to express "special" characters.
    E.g. you can describe "ó" as a single character,
    or as accent + "o".
    What you want is the "canonical form".
    Take a look at unicodedata.nor malize (well, it came
    new with Python 2.3)



    Hope this helps,

    Michael Radziej

    Comment

    Working...