utf-8 encoding issue

**Fredrik Lundh** · Jul 18 '05, 02:44 AM

Re: utf-8 encoding issue

Marc Petitmermet wrote:
[color=blue]
> In a web form, the user enters "öttinger" and wants to search with this
> search string. My idea is now to convert the search string (which also
> could be e.g. some cyrillic text) into unicode and then to utf-8:
>
> unicode(search_ string).encode( 'utf-8')
>
> This gives me the utf-8 encoded version of the string but not yet in the
> correct representation. How can I get the correct one (is this the hex
> version? I don't know the correct terminology.)?
>
> In short: how do I e.g. convert a sting containing a "ö" into a string
> containing a "%Ö"?[/color]

that's not UTF-8, that's HTML/XML-style charrefs.

if mysql translates the charref's to unicode characters, you can simply
use:

s = u.encode("ascii ", "xmlcharrefrepl ace")

where "u" is a unicode string.

if you've stored charrefs as is in the database, you're in for some
serious trouble. assuming that all charrefs are hexadecimal charrefs,
you can use something like:

def fixup(m): return "&#" + hex(int(m.group (1)))[1:]
s = re.sub("&#(\d+) ", fixup, u.encode("ascii ", "xmlcharrefrepl ace"))

to map all non-ASCII characters to charrefs, and then translate all
charrefs to hexadecimal charrefs.

decoding the charrefs *before* you add the strings to the database
is a better idea, though.

</F>

utf-8 encoding issue

utf-8 encoding issue

Comment