Replace accented chars with unaccented ones

**Nicolas Bouillon** · Jul 18 '05, 09:27 AM

Re: Replace accented chars with unaccented ones

Thank you both for your answer. They works well both very good.

First, i believe i doesn't work, because the error i've made is to
forgot the "u" for string : u"é". Because my file was already utf-8
encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
i was wrong.

Bye.

**Jeff Epler** · Jul 18 '05, 09:27 AM

Re: Replace accented chars with unaccented ones

You have two options. First, convert the string to Unicode and use code
like the following:

replacements = [(u'\xe9', 'e'), ...]
def remove_accents( u):
for a, b in replacements:
u = u.replace(a, b)
return u
[color=blue][color=green][color=darkred]
>>> remove_accents( u'\xe9')[/color][/color][/color]
u'e'

Second, if you are using a single-byte encoding (iso8859-1, for
instance), then work with byte string:
replacement_map = string.maketran s('\xe9...', 'e...')
def remove_accents( s):
return s.translate(rep lacement_map)
[color=blue][color=green][color=darkred]
>>> remove_accents( '\xe9')[/color][/color][/color]
'e'

If you want to have strings like u'é' in your programs, you have to
include a line at the top of the source file that tells Python the
encoding, like the following line does:
# -*- coding: utf-8 -*-
(except you have to name the encoding your editor uses, if it's not
utf-8) See http://python.org/peps/pep-0263.html

Once you've done that, you can write
replacements = [(u'é', 'e'), ...]
instead of using the \xXX escape for it.

Jeff

**Josiah Carlson** · Jul 18 '05, 09:27 AM

Re: Replace accented chars with unaccented ones

Jeff Epler wrote:
[color=blue]
> You have two options. First, convert the string to Unicode and use code
> like the following:
>
> replacements = [(u'\xe9', 'e'), ...]
> def remove_accents( u):
> for a, b in replacements:
> u = u.replace(a, b)
> return u
>
>[color=green][color=darkred]
>>>>remove_acce nts(u'\xe9')[/color][/color]
>
> u'e'
>
> Second, if you are using a single-byte encoding (iso8859-1, for
> instance), then work with byte string:
> replacement_map = string.maketran s('\xe9...', 'e...')
> def remove_accents( s):
> return s.translate(rep lacement_map)
>
>[color=green][color=darkred]
>>>>remove_acce nts('\xe9')[/color][/color]
>
> 'e'
>
> If you want to have strings like u'é' in your programs, you have to
> include a line at the top of the source file that tells Python the
> encoding, like the following line does:
> # -*- coding: utf-8 -*-
> (except you have to name the encoding your editor uses, if it's not
> utf-8) See http://python.org/peps/pep-0263.html
>
> Once you've done that, you can write
> replacements = [(u'é', 'e'), ...]
> instead of using the \xXX escape for it.[/color]

Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.

mapping = dict(replacemen t_pairs)

def multi_replace(i np, mapping=mapping ):
return u''.join([mapping.get(i, i) for i in inp])

One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement _pairs)) time as given.

- Josiah

**Fuzzyman** · Jul 18 '05, 09:27 AM

Re: Replace accented chars with unaccented ones

Nicolas Bouillon <bouil@bouil.or g.invalid> wrote in message news:<EWx5c.303 46$zm5.12052@nn tpserver.swip.n et>...[color=blue]
> Thank you both for your answer. They works well both very good.
>
> First, i believe i doesn't work, because the error i've made is to
> forgot the "u" for string : u"é". Because my file was already utf-8
> encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
> i was wrong.
>
> Bye.[/color]

The 'utils1' package includes a file called charmap which is a
function to map to ascii....... Originally comes from a 'python
snippet' on sourceforge I believe....

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Regards,

Fuzzy

**Michael Hudson** · Jul 18 '05, 09:27 AM

Re: Replace accented chars with unaccented ones

Jeff Epler <jepler@unpytho nic.net> writes:
[color=blue]
> You have two options. First, convert the string to Unicode and use code
> like the following:
>
> replacements = [(u'\xe9', 'e'), ...]
> def remove_accents( u):
> for a, b in replacements:
> u = u.replace(a, b)
> return u
>[/color]

There must be some more high powered way of doing this... something
like:

def remove_accent1( c):
return unicodedata.nor malize('NFD', c)[0]
def remove_accents( s):
return u''.join(map(re move_accent1, s))

?

Cheers,
mwh

--
We've had a lot of problems going from glibc 2.0 to glibc 2.1.
People claim binary compatibility. Except for functions they
don't like. -- Peter Van Eynde, comp.lang.lisp

**Jeff Epler** · Jul 18 '05, 09:28 AM

Re: Replace accented chars with unaccented ones

On Mon, Mar 15, 2004 at 06:19:00PM -0800, Josiah Carlson wrote:[color=blue]
> Translating the replacements pairs into a dictionary would result in a
> significant speedup for large numbers of replacements.
>
> mapping = dict(replacemen t_pairs)
>
> def multi_replace(i np, mapping=mapping ):
> return u''.join([mapping.get(i, i) for i in inp])
>
> One pass through the file gives an O(len(inp)) algorithm, much better
> (running-time wise) than the string.replace method that runs in
> O(len(inp) * len(replacement _pairs)) time as given.[/color]

Thanks for posting this. My other code was pretty hopeless, but for
some reason .get(i, i) didn't come to mind as a solution.

Jeff

**Jeff Epler** · Jul 18 '05, 09:28 AM

Re: Replace accented chars with unaccented ones

On Tue, Mar 16, 2004 at 08:26:08AM +0100, Nicolas Bouillon wrote:[color=blue]
> Thank you both for your answer. They works well both very good.
>
> First, i believe i doesn't work, because the error i've made is to
> forgot the "u" for string : u"é". Because my file was already utf-8
> encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
> i was wrong.[/color]

When there are non-unicode string literals in a file, they are simply
byte sequences. Take this program, for instance:

# -*- coding: utf-8 -*-
s = "é"
print len(s), repr(s)

$ python bytestr.py
2 '\xc3\xa9'

Jeff

**Noah** · Jul 18 '05, 09:28 AM

Re: Replace accented chars with unaccented ones

Nicolas Bouillon <bouil@bouil.or g.invalid> wrote in message news:<Tar5c.303 13$zm5.12006@nn tpserver.swip.n et>...[color=blue]
> Hi
>
> I would like to replace accentuel chars (like "ÃƒÂ©", "ÃƒÂ¨" or "ÃƒÂ ") with non
> accetued ones ("ÃƒÂ©" -> "e", "ÃƒÂ¨" -> "e", "ÃƒÂ " -> "a").
>
> I have tried string.replace method, but it seems dislike non ascii chars...[/color]

The following is the code that I use. This looks like what you are asking for.

In case this gets corrupted you can also find it here:

Page not found - SourceForge.net

http://sourceforge.net/snippet/detail.php?type=snippet&id=101229

Free, secure and fast downloads from the largest Open Source applications and software directory - SourceForge.net

This has some improvements to readability and speed, but it is basically
the same:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

Yours,
Noah

#!/usr/bin/env python
"""
UNICODE Hammer -- The Stupid American

I needed something that would take a UNICODE string and
smack it into ASCII. This function doesn't just strip out the characters.
It tries to convert Latin-1 characters into ASCII equivalents where possible.

We get customer mailing address data from Europe, but most of our systems
cannot handle the Latin-1 characters. All I needed was to prepare addresses
for a few different shipping systems that we use.
None of these systems support anything but ASCII.
After getting headaches trying to deal with this problem using Python's
built-in UNICODE support I gave up and decided to write something that
would solve the problem the American way -- with brute force.
I convert all european accented letters to their unaccented equivalents.
I realize this isn't perfect, but for my purposes the packages get delivered.

Noah Spurrier noah@noah.org
License free and public domain
"""

def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaninful. Anything not converted is deleted.
"""
xlate={0xc0:'A' , 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency }',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section} ', 0xa8:'{umlaut}' ,
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees} ',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragrap h}', 0xb7:'*', 0xb8:'{cedilla} ',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}

r = ''
for i in unicrap:
if xlate.has_key(o rd(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r

# This gives an example of how to use latin1_to_ascii ().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','lat in-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c), 'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii (s)

**Martin v. Löwis** · Jul 18 '05, 09:28 AM

Re: Replace accented chars with unaccented ones

Josiah Carlson wrote:[color=blue]
> Translating the replacements pairs into a dictionary would result in a
> significant speedup for large numbers of replacements.
>
> mapping = dict(replacemen t_pairs)
>
> def multi_replace(i np, mapping=mapping ):
> return u''.join([mapping.get(i, i) for i in inp])[/color]

Using the .translate() method on unicode strings should be
even more performant:

# prepare mapping table to match .translate interface
table = {}
for k,v in replacement_pai rs: table[ord(k)]=v

def multi_replace(i np):
return inp.translate(t able)

Regards,
Martin

**Josiah Carlson** · Jul 18 '05, 09:29 AM

Re: Replace accented chars with unaccented ones

> r += xlate[ord(i)][color=blue]
> r += i[/color]

Perhaps I'm going to have to create a signature and drop information
about this in every post to c.l.py, but repeated string additions are
slow as hell for any reasonably large lengthed string. It is much
faster to place characters into a list and ''.join() them.
[color=blue][color=green][color=darkred]
>>> def test_s(l):[/color][/color][/color]
.... t = time.time()
.... for i in xrange(100):
.... a = ''
.... for j in xrange(l):
.... a += '0'
.... return time.time()-t
....[color=blue][color=green][color=darkred]
>>> def test_l(l):[/color][/color][/color]
.... t = time.time()
.... for i in xrange(100):
.... a = ''.join(['0' for j in xrange(l)])
.... return time.time()-t
....[color=blue][color=green][color=darkred]
>>> i = 128
>>> while i < 4097:[/color][/color][/color]
.... print test_s(i), test_l(i)
.... i *= 2
....
0.0150001049042 0.0309998989105
0.0469999313354 0.047000169754
0.140999794006 0.109000205994
0.343999862671 0.203000068665
0.905999898911 0.40700006485
2.56200003624 0.828000068665

At 256 characters long, it looks about even. Anything longer and
''.join(lst) is significantly faster.

When we do something like the below, the overhead of creating short
lists is significant, but it is still faster when l is greater than
roughly 2048:
a = []
for i in xrange(l):
a += ['0']

- Josiah

**Josiah Carlson** · Jul 18 '05, 09:29 AM

Re: Replace accented chars with unaccented ones

> Using the .translate() method on unicode strings should be[color=blue]
> even more performant:
>
> # prepare mapping table to match .translate interface
> table = {}
> for k,v in replacement_pai rs: table[ord(k)]=v
>
> def multi_replace(i np):
> return inp.translate(t able)[/color]

Even better *smile*.

- Josiah

**Noah** · Jul 18 '05, 09:30 AM

Re: Replace accented chars with unaccented ones

Josiah Carlson <jcarlson@nospa m.uci.edu> wrote in message news:<c37ugc$ll q$1@news.servic e.uci.edu>...[color=blue][color=green]
> > r += xlate[ord(i)]
> > r += i[/color]
>
> Perhaps I'm going to have to create a signature and drop information
> about this in every post to c.l.py, but repeated string additions are
> slow as hell for any reasonably large lengthed string. It is much
> faster to place characters into a list and ''.join() them.[/color]

True. Is this better?

... body of latin1_to_ascii () ...
r = []
for i in unicrap:
if xlate.has_key(o rd(i)):
r.append (xlate[ord(i)])
elif ord(i) >= 0x80:
pass
else:
r.append (i)
return ''.join(r)

Yours,
Noah

**Josiah Carlson** · Jul 18 '05, 09:31 AM

Re: Replace accented chars with unaccented ones

Noah wrote:
[color=blue]
> Josiah Carlson <jcarlson@nospa m.uci.edu> wrote in message news:<c37ugc$ll q$1@news.servic e.uci.edu>...
>[color=green][color=darkred]
>>> r += xlate[ord(i)]
>>> r += i[/color]
>>
>>Perhaps I'm going to have to create a signature and drop information
>>about this in every post to c.l.py, but repeated string additions are
>>slow as hell for any reasonably large lengthed string. It is much
>>faster to place characters into a list and ''.join() them.[/color]
>
>
> True. Is this better?
>
> ... body of latin1_to_ascii () ...
> r = []
> for i in unicrap:
> if xlate.has_key(o rd(i)):
> r.append (xlate[ord(i)])
> elif ord(i) >= 0x80:
> pass
> else:
> r.append (i)
> return ''.join(r)[/color]

I'd use:
''.join([xlate.get(ord(i ), i) for i in unicrap \
if ord(i) in xlate or ord(i) < 0x80]

Using r.append(), in general, while being faster than string addition,
is significantly slower than using list comprehensions.

- Josiah

**AdSR** · Jul 18 '05, 09:32 AM

Re: Replace accented chars with unaccented ones

Nicolas Bouillon <bouil@bouil.or g.invalid> wrote:[color=blue]
> Hi
>
> I would like to replace accentuel chars (like "ÃƒÂƒÃ‚Â©", "ÃƒÂƒÃ‚Â¨" or "ÃƒÂƒÃ‚Â ") with non
> accetued ones ("ÃƒÂƒÃ‚Â©" -> "e", "ÃƒÂƒÃ‚Â¨" -> "e", "ÃƒÂƒÃ‚Â " -> "a").
>
> I have tried string.replace method, but it seems dislike non ascii chars...
>
> Can you help me please ?
> Thanks.[/color]

You could try experimenting with the 'unicodedata' module:
[color=blue][color=green][color=darkred]
>>> import unicodedata
>>> [unicodedata.nam e(x) for x in u'123 abc @#$ \u00ff'][/color][/color][/color]
['DIGIT ONE', 'DIGIT TWO', 'DIGIT THREE', 'SPACE', 'LATIN SMALL LETTER
A', 'LATIN SMALL LETTER B', 'LATIN SMALL LETTER C', 'SPACE',
'COMMERCIAL AT', 'NUMBER SIGN', 'DOLLAR SIGN', 'SPACE', 'LATIN SMALL
LETTER Y WITH DIAERESIS'][color=blue][color=green][color=darkred]
>>> unicodedata.loo kup('latin capital letter a with grave')[/color][/color][/color]
u'\xc0'

You could strip the ' WITH...' part when applicable and convert names
back to string. You would only need to process characters with ord >=
160.

HTH,

AdSR

Replace accented chars with unaccented ones

Replace accented chars with unaccented ones

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment