utf-8 encoding issue

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Marc Petitmermet

    utf-8 encoding issue

    The line below looks up the name "öttinger" (with the German umlaut) of
    an author using the mysql console:

    mysql> select author from records where author like '%Öttinger %';

    This successfully finds all entries in the records database where
    "öttinger" is the author or the co-author.

    In a web form, the user enters "öttinger" and wants to search with this
    search string. My idea is now to convert the search string (which also
    could be e.g. some cyrillic text) into unicode and then to utf-8:

    unicode(search_ string).encode( 'utf-8')

    This gives me the utf-8 encoded version of the string but not yet in the
    correct representation. How can I get the correct one (is this the hex
    version? I don't know the correct terminology.)?

    In short: how do I e.g. convert a sting containing a "ö" into a string
    containing a "%Ö"?

    Regards,
    Marc
  • Fredrik Lundh

    #2
    Re: utf-8 encoding issue

    Marc Petitmermet wrote:
    [color=blue]
    > In a web form, the user enters "öttinger" and wants to search with this
    > search string. My idea is now to convert the search string (which also
    > could be e.g. some cyrillic text) into unicode and then to utf-8:
    >
    > unicode(search_ string).encode( 'utf-8')
    >
    > This gives me the utf-8 encoded version of the string but not yet in the
    > correct representation. How can I get the correct one (is this the hex
    > version? I don't know the correct terminology.)?
    >
    > In short: how do I e.g. convert a sting containing a "ö" into a string
    > containing a "%Ö"?[/color]

    that's not UTF-8, that's HTML/XML-style charrefs.

    if mysql translates the charref's to unicode characters, you can simply
    use:

    s = u.encode("ascii ", "xmlcharrefrepl ace")

    where "u" is a unicode string.

    if you've stored charrefs as is in the database, you're in for some
    serious trouble. assuming that all charrefs are hexadecimal charrefs,
    you can use something like:

    def fixup(m): return "&#" + hex(int(m.group (1)))[1:]
    s = re.sub("&#(\d+) ", fixup, u.encode("ascii ", "xmlcharrefrepl ace"))

    to map all non-ASCII characters to charrefs, and then translate all
    charrefs to hexadecimal charrefs.

    decoding the charrefs *before* you add the strings to the database
    is a better idea, though.

    </F>




    Comment

    Working...