accented characters to unaccented

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • s2krish
    New Member
    • Jun 2010
    • 2

    accented characters to unaccented

    Is there python library or function to convert accented characters to unaccented. For example

    From 'Terme \xc4\x8cate\xc5 \xbe' to 'Terme Čatež'.

    When I read Terme Čatež website using urllib.urlopen( ) functio, gives 'Terme \xc4\x8cate\xc5 \xbe'
  • Glenton
    Recognized Expert Contributor
    • Nov 2008
    • 391

    #2
    This doesn't answer your question, but you could include
    Code:
    # -*- coding: utf-8 -*-
    in the first or second line of your code. That should enable it to display the accented characters.

    What you're trying to do is possible with the module unicodedata.

    Code:
    t='\xc5\xbe'
    u=t.decode('utf-8')
    unicodedata.decomposition(u)
    returns '007A 030C'. The former is the unicode hex for the letter your looking for. 007A = 122, and unichr(122) = u'z'. Or chr('122') = 'z'.

    That should be enough to get you going. I'm afraid I don't use this much, but let us know how you get on.

    Comment

    • s2krish
      New Member
      • Jun 2010
      • 2

      #3
      Originally posted by Glenton
      This doesn't answer your question, but you could include
      Code:
      # -*- coding: utf-8 -*-
      in the first or second line of your code. That should enable it to display the accented characters.

      What you're trying to do is possible with the module unicodedata.

      Code:
      t='\xc5\xbe'
      u=t.decode('utf-8')
      unicodedata.decomposition(u)
      returns '007A 030C'. The former is the unicode hex for the letter your looking for. 007A = 122, and unichr(122) = u'z'. Or chr('122') = 'z'.

      That should be enough to get you going. I'm afraid I don't use this much, but let us know how you get on.
      Hi,

      Thanks for your reply. Let me elaborate problem:

      I have used urllib module to open and read web site, scripts looks like:
      import urllib
      txt = urllib.urlopen( "http://www.terme-catez.si").read ()
      txt

      gives result like below:
      ....some more portion is skipped....
      Code:
      <div class="noga">\r\n    <p>\r\n      Vse gradivo\r\n      &copy; 1999-\
      r\n      2010\r\n      <a href="http://www.terme-catez.si" target="_blank">Terme
       \xc4\x8cate\xc5\xbe</a>\r\n      Slovenija\r\n      <br />\r\n      Spletne re\
      xc5\xa1itve\r\n      &copy; 1996-\r\n      2010\r\n      <a href="http://www.tme
      dia.biz" target="_blank">(T)media</a></p>\r\n  </div>\r\n</div>\r\n<div class="o
      If you see above code, accented chars looks like:
      Terme \xc4\x8cate\xc5 \xbe (original is Terme Čatež).

      However, I want Terme Čatež to Terme Catez. So, code like \xc4\x8c or \xc5\xbe should be converted into unaccented chars.

      Is there any way to replace all such code to unaccented chars.

      Thanks
      Last edited by Dormilich; Jun 8 '10, 04:05 PM. Reason: Please use [code] tags when posting code

      Comment

      Working...