Replace accented chars with unaccented ones

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Nicolas Bouillon

    Replace accented chars with unaccented ones

    Hi

    I would like to replace accentuel chars (like "é", "è" or "à") with non
    accetued ones ("é" -> "e", "è" -> "e", "à" -> "a").

    I have tried string.replace method, but it seems dislike non ascii chars...

    Can you help me please ?
    Thanks.
  • Nicolas Bouillon

    #2
    Re: Replace accented chars with unaccented ones

    Thank you both for your answer. They works well both very good.

    First, i believe i doesn't work, because the error i've made is to
    forgot the "u" for string : u"é". Because my file was already utf-8
    encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
    i was wrong.

    Bye.

    Comment

    • Jeff Epler

      #3
      Re: Replace accented chars with unaccented ones

      You have two options. First, convert the string to Unicode and use code
      like the following:

      replacements = [(u'\xe9', 'e'), ...]
      def remove_accents( u):
      for a, b in replacements:
      u = u.replace(a, b)
      return u
      [color=blue][color=green][color=darkred]
      >>> remove_accents( u'\xe9')[/color][/color][/color]
      u'e'

      Second, if you are using a single-byte encoding (iso8859-1, for
      instance), then work with byte string:
      replacement_map = string.maketran s('\xe9...', 'e...')
      def remove_accents( s):
      return s.translate(rep lacement_map)
      [color=blue][color=green][color=darkred]
      >>> remove_accents( '\xe9')[/color][/color][/color]
      'e'

      If you want to have strings like u'é' in your programs, you have to
      include a line at the top of the source file that tells Python the
      encoding, like the following line does:
      # -*- coding: utf-8 -*-
      (except you have to name the encoding your editor uses, if it's not
      utf-8) See http://python.org/peps/pep-0263.html

      Once you've done that, you can write
      replacements = [(u'é', 'e'), ...]
      instead of using the \xXX escape for it.

      Jeff

      Comment

      • Josiah Carlson

        #4
        Re: Replace accented chars with unaccented ones

        Jeff Epler wrote:
        [color=blue]
        > You have two options. First, convert the string to Unicode and use code
        > like the following:
        >
        > replacements = [(u'\xe9', 'e'), ...]
        > def remove_accents( u):
        > for a, b in replacements:
        > u = u.replace(a, b)
        > return u
        >
        >[color=green][color=darkred]
        >>>>remove_acce nts(u'\xe9')[/color][/color]
        >
        > u'e'
        >
        > Second, if you are using a single-byte encoding (iso8859-1, for
        > instance), then work with byte string:
        > replacement_map = string.maketran s('\xe9...', 'e...')
        > def remove_accents( s):
        > return s.translate(rep lacement_map)
        >
        >[color=green][color=darkred]
        >>>>remove_acce nts('\xe9')[/color][/color]
        >
        > 'e'
        >
        > If you want to have strings like u'é' in your programs, you have to
        > include a line at the top of the source file that tells Python the
        > encoding, like the following line does:
        > # -*- coding: utf-8 -*-
        > (except you have to name the encoding your editor uses, if it's not
        > utf-8) See http://python.org/peps/pep-0263.html
        >
        > Once you've done that, you can write
        > replacements = [(u'é', 'e'), ...]
        > instead of using the \xXX escape for it.[/color]

        Translating the replacements pairs into a dictionary would result in a
        significant speedup for large numbers of replacements.

        mapping = dict(replacemen t_pairs)

        def multi_replace(i np, mapping=mapping ):
        return u''.join([mapping.get(i, i) for i in inp])

        One pass through the file gives an O(len(inp)) algorithm, much better
        (running-time wise) than the string.replace method that runs in
        O(len(inp) * len(replacement _pairs)) time as given.

        - Josiah

        Comment

        • Fuzzyman

          #5
          Re: Replace accented chars with unaccented ones

          Nicolas Bouillon <bouil@bouil.or g.invalid> wrote in message news:<EWx5c.303 46$zm5.12052@nn tpserver.swip.n et>...[color=blue]
          > Thank you both for your answer. They works well both very good.
          >
          > First, i believe i doesn't work, because the error i've made is to
          > forgot the "u" for string : u"é". Because my file was already utf-8
          > encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
          > i was wrong.
          >
          > Bye.[/color]

          The 'utils1' package includes a file called charmap which is a
          function to map to ascii....... Originally comes from a 'python
          snippet' on sourceforge I believe....



          Regards,


          Fuzzy

          Comment

          • Michael Hudson

            #6
            Re: Replace accented chars with unaccented ones

            Jeff Epler <jepler@unpytho nic.net> writes:
            [color=blue]
            > You have two options. First, convert the string to Unicode and use code
            > like the following:
            >
            > replacements = [(u'\xe9', 'e'), ...]
            > def remove_accents( u):
            > for a, b in replacements:
            > u = u.replace(a, b)
            > return u
            >[/color]

            There must be some more high powered way of doing this... something
            like:

            def remove_accent1( c):
            return unicodedata.nor malize('NFD', c)[0]
            def remove_accents( s):
            return u''.join(map(re move_accent1, s))

            ?

            Cheers,
            mwh

            --
            We've had a lot of problems going from glibc 2.0 to glibc 2.1.
            People claim binary compatibility. Except for functions they
            don't like. -- Peter Van Eynde, comp.lang.lisp

            Comment

            • Jeff Epler

              #7
              Re: Replace accented chars with unaccented ones

              On Mon, Mar 15, 2004 at 06:19:00PM -0800, Josiah Carlson wrote:[color=blue]
              > Translating the replacements pairs into a dictionary would result in a
              > significant speedup for large numbers of replacements.
              >
              > mapping = dict(replacemen t_pairs)
              >
              > def multi_replace(i np, mapping=mapping ):
              > return u''.join([mapping.get(i, i) for i in inp])
              >
              > One pass through the file gives an O(len(inp)) algorithm, much better
              > (running-time wise) than the string.replace method that runs in
              > O(len(inp) * len(replacement _pairs)) time as given.[/color]

              Thanks for posting this. My other code was pretty hopeless, but for
              some reason .get(i, i) didn't come to mind as a solution.

              Jeff

              Comment

              • Jeff Epler

                #8
                Re: Replace accented chars with unaccented ones

                On Tue, Mar 16, 2004 at 08:26:08AM +0100, Nicolas Bouillon wrote:[color=blue]
                > Thank you both for your answer. They works well both very good.
                >
                > First, i believe i doesn't work, because the error i've made is to
                > forgot the "u" for string : u"é". Because my file was already utf-8
                > encoded (# -*- coding: UTF-8 -*-), i thinks the "u" is not necessary...
                > i was wrong.[/color]

                When there are non-unicode string literals in a file, they are simply
                byte sequences. Take this program, for instance:

                # -*- coding: utf-8 -*-
                s = "é"
                print len(s), repr(s)

                $ python bytestr.py
                2 '\xc3\xa9'

                Jeff

                Comment

                • Noah

                  #9
                  Re: Replace accented chars with unaccented ones

                  Nicolas Bouillon <bouil@bouil.or g.invalid> wrote in message news:<Tar5c.303 13$zm5.12006@nn tpserver.swip.n et>...[color=blue]
                  > Hi
                  >
                  > I would like to replace accentuel chars (like "é", "è" or "à") with non
                  > accetued ones ("é" -> "e", "è" -> "e", "à" -> "a").
                  >
                  > I have tried string.replace method, but it seems dislike non ascii chars...[/color]

                  The following is the code that I use. This looks like what you are asking for.

                  In case this gets corrupted you can also find it here:
                  Free, secure and fast downloads from the largest Open Source applications and software directory - SourceForge.net

                  This has some improvements to readability and speed, but it is basically
                  the same:


                  Yours,
                  Noah

                  #!/usr/bin/env python
                  """
                  UNICODE Hammer -- The Stupid American

                  I needed something that would take a UNICODE string and
                  smack it into ASCII. This function doesn't just strip out the characters.
                  It tries to convert Latin-1 characters into ASCII equivalents where possible.

                  We get customer mailing address data from Europe, but most of our systems
                  cannot handle the Latin-1 characters. All I needed was to prepare addresses
                  for a few different shipping systems that we use.
                  None of these systems support anything but ASCII.
                  After getting headaches trying to deal with this problem using Python's
                  built-in UNICODE support I gave up and decided to write something that
                  would solve the problem the American way -- with brute force.
                  I convert all european accented letters to their unaccented equivalents.
                  I realize this isn't perfect, but for my purposes the packages get delivered.

                  Noah Spurrier noah@noah.org
                  License free and public domain
                  """

                  def latin1_to_ascii (unicrap):
                  """This replaces UNICODE Latin-1 characters with
                  something equivalent in 7-bit ASCII. All characters in the standard
                  7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
                  accented letters are stripped of their accents. Most symbol characters
                  are converted to something meaninful. Anything not converted is deleted.
                  """
                  xlate={0xc0:'A' , 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
                  0xc6:'Ae', 0xc7:'C',
                  0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
                  0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
                  0xd0:'Th', 0xd1:'N',
                  0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
                  0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
                  0xdd:'Y', 0xde:'th', 0xdf:'ss',
                  0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
                  0xe6:'ae', 0xe7:'c',
                  0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
                  0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
                  0xf0:'th', 0xf1:'n',
                  0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
                  0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
                  0xfd:'y', 0xfe:'th', 0xff:'y',
                  0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency }',
                  0xa5:'{yen}', 0xa6:'|', 0xa7:'{section} ', 0xa8:'{umlaut}' ,
                  0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
                  0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees} ',
                  0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
                  0xb5:'{micro}', 0xb6:'{paragrap h}', 0xb7:'*', 0xb8:'{cedilla} ',
                  0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
                  0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
                  0xd7:'*', 0xf7:'/'
                  }

                  r = ''
                  for i in unicrap:
                  if xlate.has_key(o rd(i)):
                  r += xlate[ord(i)]
                  elif ord(i) >= 0x80:
                  pass
                  else:
                  r += i
                  return r

                  # This gives an example of how to use latin1_to_ascii ().
                  # This creates a string will all the characters in the latin-1 character set
                  # then it converts the string to plain 7-bit ASCII.
                  if __name__ == '__main__':
                  s = unicode('','lat in-1')
                  for c in range(32,256):
                  if c != 0x7f:
                  s = s + unicode(chr(c), 'latin-1')
                  print 'INPUT:'
                  print s.encode('latin-1')
                  print
                  print 'OUTPUT:'
                  print latin1_to_ascii (s)

                  Comment

                  • Martin v. Löwis

                    #10
                    Re: Replace accented chars with unaccented ones

                    Josiah Carlson wrote:[color=blue]
                    > Translating the replacements pairs into a dictionary would result in a
                    > significant speedup for large numbers of replacements.
                    >
                    > mapping = dict(replacemen t_pairs)
                    >
                    > def multi_replace(i np, mapping=mapping ):
                    > return u''.join([mapping.get(i, i) for i in inp])[/color]

                    Using the .translate() method on unicode strings should be
                    even more performant:

                    # prepare mapping table to match .translate interface
                    table = {}
                    for k,v in replacement_pai rs: table[ord(k)]=v

                    def multi_replace(i np):
                    return inp.translate(t able)

                    Regards,
                    Martin

                    Comment

                    • Josiah Carlson

                      #11
                      Re: Replace accented chars with unaccented ones

                      > r += xlate[ord(i)][color=blue]
                      > r += i[/color]

                      Perhaps I'm going to have to create a signature and drop information
                      about this in every post to c.l.py, but repeated string additions are
                      slow as hell for any reasonably large lengthed string. It is much
                      faster to place characters into a list and ''.join() them.
                      [color=blue][color=green][color=darkred]
                      >>> def test_s(l):[/color][/color][/color]
                      .... t = time.time()
                      .... for i in xrange(100):
                      .... a = ''
                      .... for j in xrange(l):
                      .... a += '0'
                      .... return time.time()-t
                      ....[color=blue][color=green][color=darkred]
                      >>> def test_l(l):[/color][/color][/color]
                      .... t = time.time()
                      .... for i in xrange(100):
                      .... a = ''.join(['0' for j in xrange(l)])
                      .... return time.time()-t
                      ....[color=blue][color=green][color=darkred]
                      >>> i = 128
                      >>> while i < 4097:[/color][/color][/color]
                      .... print test_s(i), test_l(i)
                      .... i *= 2
                      ....
                      0.0150001049042 0.0309998989105
                      0.0469999313354 0.047000169754
                      0.140999794006 0.109000205994
                      0.343999862671 0.203000068665
                      0.905999898911 0.40700006485
                      2.56200003624 0.828000068665

                      At 256 characters long, it looks about even. Anything longer and
                      ''.join(lst) is significantly faster.

                      When we do something like the below, the overhead of creating short
                      lists is significant, but it is still faster when l is greater than
                      roughly 2048:
                      a = []
                      for i in xrange(l):
                      a += ['0']


                      - Josiah

                      Comment

                      • Josiah Carlson

                        #12
                        Re: Replace accented chars with unaccented ones

                        > Using the .translate() method on unicode strings should be[color=blue]
                        > even more performant:
                        >
                        > # prepare mapping table to match .translate interface
                        > table = {}
                        > for k,v in replacement_pai rs: table[ord(k)]=v
                        >
                        > def multi_replace(i np):
                        > return inp.translate(t able)[/color]

                        Even better *smile*.

                        - Josiah

                        Comment

                        • Noah

                          #13
                          Re: Replace accented chars with unaccented ones

                          Josiah Carlson <jcarlson@nospa m.uci.edu> wrote in message news:<c37ugc$ll q$1@news.servic e.uci.edu>...[color=blue][color=green]
                          > > r += xlate[ord(i)]
                          > > r += i[/color]
                          >
                          > Perhaps I'm going to have to create a signature and drop information
                          > about this in every post to c.l.py, but repeated string additions are
                          > slow as hell for any reasonably large lengthed string. It is much
                          > faster to place characters into a list and ''.join() them.[/color]

                          True. Is this better?

                          ... body of latin1_to_ascii () ...
                          r = []
                          for i in unicrap:
                          if xlate.has_key(o rd(i)):
                          r.append (xlate[ord(i)])
                          elif ord(i) >= 0x80:
                          pass
                          else:
                          r.append (i)
                          return ''.join(r)


                          Yours,
                          Noah

                          Comment

                          • Josiah Carlson

                            #14
                            Re: Replace accented chars with unaccented ones

                            Noah wrote:
                            [color=blue]
                            > Josiah Carlson <jcarlson@nospa m.uci.edu> wrote in message news:<c37ugc$ll q$1@news.servic e.uci.edu>...
                            >[color=green][color=darkred]
                            >>> r += xlate[ord(i)]
                            >>> r += i[/color]
                            >>
                            >>Perhaps I'm going to have to create a signature and drop information
                            >>about this in every post to c.l.py, but repeated string additions are
                            >>slow as hell for any reasonably large lengthed string. It is much
                            >>faster to place characters into a list and ''.join() them.[/color]
                            >
                            >
                            > True. Is this better?
                            >
                            > ... body of latin1_to_ascii () ...
                            > r = []
                            > for i in unicrap:
                            > if xlate.has_key(o rd(i)):
                            > r.append (xlate[ord(i)])
                            > elif ord(i) >= 0x80:
                            > pass
                            > else:
                            > r.append (i)
                            > return ''.join(r)[/color]

                            I'd use:
                            ''.join([xlate.get(ord(i ), i) for i in unicrap \
                            if ord(i) in xlate or ord(i) < 0x80]

                            Using r.append(), in general, while being faster than string addition,
                            is significantly slower than using list comprehensions.

                            - Josiah

                            Comment

                            • AdSR

                              #15
                              Re: Replace accented chars with unaccented ones

                              Nicolas Bouillon <bouil@bouil.or g.invalid> wrote:[color=blue]
                              > Hi
                              >
                              > I would like to replace accentuel chars (like "é", "è" or "à") with non
                              > accetued ones ("é" -> "e", "è" -> "e", "à" -> "a").
                              >
                              > I have tried string.replace method, but it seems dislike non ascii chars...
                              >
                              > Can you help me please ?
                              > Thanks.[/color]

                              You could try experimenting with the 'unicodedata' module:
                              [color=blue][color=green][color=darkred]
                              >>> import unicodedata
                              >>> [unicodedata.nam e(x) for x in u'123 abc @#$ \u00ff'][/color][/color][/color]
                              ['DIGIT ONE', 'DIGIT TWO', 'DIGIT THREE', 'SPACE', 'LATIN SMALL LETTER
                              A', 'LATIN SMALL LETTER B', 'LATIN SMALL LETTER C', 'SPACE',
                              'COMMERCIAL AT', 'NUMBER SIGN', 'DOLLAR SIGN', 'SPACE', 'LATIN SMALL
                              LETTER Y WITH DIAERESIS'][color=blue][color=green][color=darkred]
                              >>> unicodedata.loo kup('latin capital letter a with grave')[/color][/color][/color]
                              u'\xc0'

                              You could strip the ' WITH...' part when applicable and convert names
                              back to string. You would only need to process characters with ord >=
                              160.

                              HTH,

                              AdSR

                              Comment

                              Working...