accented characters to unaccented

**Glenton** · Jun 8 '10, 03:02 AM

This doesn't answer your question, but you could include

Code:

# -*- coding: utf-8 -*-

in the first or second line of your code. That should enable it to display the accented characters.

What you're trying to do is possible with the module unicodedata.

Code:

t='\xc5\xbe'
u=t.decode('utf-8')
unicodedata.decomposition(u)

returns '007A 030C'. The former is the unicode hex for the letter your looking for. 007A = 122, and unichr(122) = u'z'. Or chr('122') = 'z'.

That should be enough to get you going. I'm afraid I don't use this much, but let us know how you get on.

**s2krish** · Jun 8 '10, 05:17 AM

Originally posted by Glenton

This doesn't answer your question, but you could include

Code:

# -*- coding: utf-8 -*-

in the first or second line of your code. That should enable it to display the accented characters.

What you're trying to do is possible with the module unicodedata.

Code:

t='\xc5\xbe'
u=t.decode('utf-8')
unicodedata.decomposition(u)

returns '007A 030C'. The former is the unicode hex for the letter your looking for. 007A = 122, and unichr(122) = u'z'. Or chr('122') = 'z'.

That should be enough to get you going. I'm afraid I don't use this much, but let us know how you get on.

Hi,

Thanks for your reply. Let me elaborate problem:

I have used urllib module to open and read web site, scripts looks like:
import urllib
txt = urllib.urlopen( "http://www.terme-catez.si").read ()
txt

gives result like below:
....some more portion is skipped....

Code:

<div class="noga">\r\n    <p>\r\n      Vse gradivo\r\n      &copy; 1999-\
r\n      2010\r\n      <a href="http://www.terme-catez.si" target="_blank">Terme
 \xc4\x8cate\xc5\xbe</a>\r\n      Slovenija\r\n      <br />\r\n      Spletne re\
xc5\xa1itve\r\n      &copy; 1996-\r\n      2010\r\n      <a href="http://www.tme
dia.biz" target="_blank">(T)media</a></p>\r\n  </div>\r\n</div>\r\n<div class="o

If you see above code, accented chars looks like:
Terme \xc4\x8cate\xc5 \xbe (original is Terme Čatež).

However, I want Terme Čatež to Terme Catez. So, code like \xc4\x8c or \xc5\xbe should be converted into unaccented chars.

Is there any way to replace all such code to unaccented chars.

Thanks

accented characters to unaccented

accented characters to unaccented

Comment

Comment