Replacement in unicodestrings?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • KvS

    Replacement in unicodestrings?

    Dear all,

    could somebody please just put an end to the unicode mysery I'm in,
    men... The situation is that I have a Tkinter program that let's the
    user enter data in some Entries and this data needs to be transformed
    to the encoding compatible with an .rtf-file. In fact I only need to
    do some of the usual symbols like ë etc.

    Here's the function that I am using:

    def pythonUnicodeTo RTFAscii(self,s ):
    if isinstance(s,st r):
    return s
    s_str=repr(s.en code('UTF-8'))
    replDic={'\xc3\ xa0':"\\'e0",'\ xc3\xa4':"\\'e4 ",'\xc3\xa1 ':"\
    \'e1",
    '\xc3\xa8':"\\' e8",'\xc3\xab': "\\'eb",'\xc3\x a9':"\
    \'e9",
    '\xc3\xb2':"\\' f2",'\xc3\xb6': "\\'f6",'\xc3\x b3':"\
    \'f3",
    '\xe2\x82\xac': "\\'80"}
    for k in replDic.keys():
    if repr(k) in s_str:
    s_str=s_str.rep lace(repr(k),re plDic[k])
    return s_str

    So replDic represents the mapping from one encoding to the other. Now,
    if I enter e.g. 'Arjën' in the Entry, then s_str in the above function
    becomes 'Arj\xc3\xabn' and since replDic contains the key \xc3\xab I
    would expect the replacement in the final lines of the function to
    kick in. This however doesn't happen, there's no match.

    However interactive:
    >>'\xc3\xab' in 'Arj\xc3\xabn'
    True

    I just don't get it, what's the difference? Is the above anyhow the
    best way to attack such a problem?

    Thanks & best wishes, Kees
  • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

    #2
    Re: Replacement in unicodestrings?

    s_str=repr(s.en code('UTF-8'))

    It would be easier to encode this in cp1252 here, as this is apparently
    the encoding that you want to use in the RTF file, too. You could then
    loop over the string, replacing all bytes >= 128 with \\'%.2x

    As yet another alternative, you could create a Unicode error handler
    (call it 'rtf'), and then do

    return s.encode('ascii ', errors='rtf')
    replDic={'\xc3\ xa0':"\\'e0",'\ xc3\xa4':"\\'e4 ",'\xc3\xa1 ':"\
    \'e1",
    '\xc3\xa8':"\\' e8",'\xc3\xab': "\\'eb",'\xc3\x a9':"\
    \'e9",
    '\xc3\xb2':"\\' f2",'\xc3\xb6': "\\'f6",'\xc3\x b3':"\
    \'f3",
    '\xe2\x82\xac': "\\'80"}
    for k in replDic.keys():
    if repr(k) in s_str:
    s_str=s_str.rep lace(repr(k),re plDic[k])
    return s_str
    >
    However interactive:
    >
    >>>'\xc3\xab' in 'Arj\xc3\xabn'
    True
    >
    I just don't get it, what's the difference?
    It's the repr():

    py'\xc3\xab' in 'Arj\xc3\xabn'
    True
    pyrepr('\xc3\xa b') in repr('Arj\xc3\x abn')
    False
    pyrepr('\xc3\xa b')
    "'\\xc3\\xa b'"
    pyrepr('Arj\xc3 \xabn')
    "'Arj\\xc3\\xab n'"

    repr('\xc3\xab' ) starts with an apostrophe, which doesn't
    appear before the \\xc3 in repr('Arj\xc3\x abn').

    HTH,
    Martin

    Comment

    Working...