ascii to latin1

**Robert Kern** · May 9 '06, 12:35 AM

Re: ascii to latin1

Luis P. Mendes wrote:
[color=blue]
> example:
> if the word searched is 'televisÃ£o', I want that a search by either
> 'televisao', 'televisÃ£o' or even 'tÃ©lÃ©visao' (this last one doesn't
> exist in Portuguese) is successful.[/color]

The ICU library has the capability to transliterate strings via certain
rulesets. One such ruleset would transliterate all of the above to 'televisao'.
That transliteration could act as a normalization step akin to stemming.

There are one or two Python bindings out there. Google for PyICU. I don't recall
if it exposes the transliteration API or not.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

**Rene Pijlman** · May 9 '06, 01:15 AM

Re: ascii to latin1

Luis P. Mendes:[color=blue]
>I'm developing a django based intranet web server that has a search page.
>
>Data contained in the database is mixed. Some of the words are
>accented, some are not but they should be. This is because the
>collection of data began a long time ago when ascii was the only way to go.
>
>The problem is users have to search more than once for some word,
>because the searched word can be or not be accented. If we consider
>that some expressions can have several letters that can be accented, the
>search effort is too much.[/color]

I guess the best solution is to index all data in ASCII. That is, convert
a field to ASCII (from accented character to its unaccented constituent)
and index that.

Then, on a search, you also need to unaccent the search phrase, and match
it against the asciified index.

--
René Pijlman

**Serge Orlov** · May 9 '06, 01:25 AM

Re: ascii to latin1

Luis P. Mendes wrote:[color=blue]
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I'm developing a django based intranet web server that has a search page.
>
> Data contained in the database is mixed. Some of the words are
> accented, some are not but they should be. This is because the
> collection of data began a long time ago when ascii was the only way to go.
>
> The problem is users have to search more than once for some word,
> because the searched word can be or not be accented. If we consider
> that some expressions can have several letters that can be accented, the
> search effort is too much.
>
> I've searched the net for some kind of solution but couldn't find. I've
> just found for the opposite.
>
> example:
> if the word searched is 'televisão', I want that a search by either
> 'televisao', 'televisão' or even 'télévisao' (this last one doesn't
> exist in Portuguese) is successful.
>
> So, instead of only one search, there will be several used.
>
> Is there anything already coded, or will I have to try to do it all by
> myself?[/color]

You need to covert from latin1 to ascii not from ascii to latin1. The
function below does that. Then you need to build database index not on
latin1 text but on ascii text. After that convert user input to ascii
and search.

import unicodedata

def search_key(s):
de_str = unicodedata.nor malize("NFD", s)
return ''.join(cp for cp in de_str if not
unicodedata.cat egory(cp).start swith('M'))

print search_key(u"te levisão")
print search_key(u"té lévisao")

===== Result:
televisao
televisao

**Richie Hindle** · May 10 '06, 03:15 AM

Re: ascii to latin1

[Serge][color=blue]
> def search_key(s):
> de_str = unicodedata.nor malize("NFD", s)
> return ''.join(cp for cp in de_str if not
> unicodedata.cat egory(cp).start swith('M'))[/color]

Lovely bit of code - thanks for posting it!

You might want to use "NFKD" to normalize things like LATIN SMALL
LIGATURE FI and subscript/superscript characters as well as diacritics.

--
Richie Hindle
richie@entrian. com

**Luis P. Mendes** · May 10 '06, 03:15 AM

Re: ascii to latin1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Richie Hindle escreveu:[color=blue]
> [Serge][color=green]
>> def search_key(s):
>> de_str = unicodedata.nor malize("NFD", s)
>> return ''.join(cp for cp in de_str if not
>> unicodedata.cat egory(cp).start swith('M'))[/color]
>
> Lovely bit of code - thanks for posting it!
>
> You might want to use "NFKD" to normalize things like LATIN SMALL
> LIGATURE FI and subscript/superscript characters as well as diacritics.
>[/color]

Thank you very much for your info. It's a very good aproach.

When I used the "NFD" option, I came across many errors on these and
possibly other codes: \xba, \xc9, \xcd.

I tried to use "NFKD" instead, and the number of errors was only about
half a dozen, for a universe of 600000+ names, on code \xbf.

It looks like I have to do a search and substitute using regular
expressions for these cases. Or is there a better way to do it?

Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYINaHn4 UHCY8rB8RAqLKAJ 0cN7yRlzJSpmH7j lrWoyhUH1990wCg kxCW
9d7f/FyHXoSfRUrbES0X KvU=
=eAuO
-----END PGP SIGNATURE-----

**Richie Hindle** · May 10 '06, 03:15 AM

Re: ascii to latin1

[Luis][color=blue]
> When I used the "NFD" option, I came across many errors on these and
> possibly other codes: \xba, \xc9, \xcd.[/color]

What errors? This works fine for me, printing "Ecoute":

import unicodedata
def search_key(s):
de_str = unicodedata.nor malize("NFD", s)
return ''.join([cp for cp in de_str if not
unicodedata.cat egory(cp).start swith('M')])
print search_key(u"\x c9coute")

Are you using unicode code point \xc9, or is that a byte in some
encoding? Which encoding?

--
Richie

**Serge Orlov** · May 10 '06, 03:15 AM

Re: ascii to latin1

Richie Hindle wrote:[color=blue]
> [Serge][color=green]
> > def search_key(s):
> > de_str = unicodedata.nor malize("NFD", s)
> > return ''.join(cp for cp in de_str if not
> > unicodedata.cat egory(cp).start swith('M'))[/color]
>
> Lovely bit of code - thanks for posting it![/color]

Well, it is not so good. Please read my next message to Luis.
[color=blue]
>
> You might want to use "NFKD" to normalize things like LATIN SMALL
> LIGATURE FI and subscript/superscript characters as well as diacritics.[/color]

IMHO It is perfectly acceptable to declare you don't interpret those
symbols. After all they are called *compatibility* code points. I
tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
doesn't support it at all.

NFKD form is also more tricky to use. It loses semantic of characters,
for example if you have character "digit two" followed by "superscrip t
digit two"; they look like 2 power 2, but NFKD will convert them into
22 (twenty two), which is wrong. So if you want to use NFKD for search
your will have to preprocess your data, for example inserting space
between the twos.

**Serge Orlov** · May 10 '06, 03:15 AM

Re: ascii to latin1

Luis P. Mendes wrote:[color=blue]
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Richie Hindle escreveu:[color=green]
> > [Serge][color=darkred]
> >> def search_key(s):
> >> de_str = unicodedata.nor malize("NFD", s)
> >> return ''.join(cp for cp in de_str if not
> >> unicodedata.cat egory(cp).start swith('M'))[/color]
> >
> > Lovely bit of code - thanks for posting it!
> >
> > You might want to use "NFKD" to normalize things like LATIN SMALL
> > LIGATURE FI and subscript/superscript characters as well as diacritics.
> >[/color]
>
> Thank you very much for your info. It's a very good aproach.
>
> When I used the "NFD" option, I came across many errors on these and
> possibly other codes: \xba, \xc9, \xcd.[/color]

What errors? normalize method is not supposed to give any errors. You
mean it doesn't work as expected? Well, I have to admit that using
normalize is a far from perfect way to implement search. The most
advanced algorithm is published by Unicode guys:
<http://www.unicode.org/reports/tr10/> If you read it you'll understand
it's not so easy.
[color=blue]
>
> I tried to use "NFKD" instead, and the number of errors was only about
> half a dozen, for a universe of 600000+ names, on code \xbf.
> It looks like I have to do a search and substitute using regular
> expressions for these cases. Or is there a better way to do it?[/color]

Perhaps you can use unicode translate method to map the characters that
still give you problems to whatever you want.

**Richie Hindle** · May 10 '06, 03:15 AM

Re: ascii to latin1

[Serge][color=blue]
> I have to admit that using
> normalize is a far from perfect way to implement search. The most
> advanced algorithm is published by Unicode guys:
> <http://www.unicode.org/reports/tr10/> If you read it you'll understand
> it's not so easy.[/color]

I only have to look at the length of the document to understand it's not
so easy. 8-) I'll take your two-line normalization function any day.
[color=blue]
> IMHO It is perfectly acceptable to declare you don't interpret those
> symbols. After all they are called *compatibility* code points. I
> tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
> doesn't support it at all. [...]
> if you have character "digit two" followed by "superscrip t
> digit two"; they look like 2 power 2, but NFKD will convert them into
> 22 (twenty two), which is wrong. So if you want to use NFKD for search
> your will have to preprocess your data, for example inserting space
> between the twos.[/color]

I'm not sure it's obvious that it's wrong. How might a user enter
"2<superscr ipt digit 2>" into a search box? They might enter a genuine
"<superscri pt digit 2>" in which case you're fine, or they might enter
"2^2" in which case it depends how you deal with punctuation. They
probably won't enter "2 2".

It's certainly not wrong in the case of ligatures like LATIN SMALL
LIGATURE FI - it's quite likely that the user will search for "fish"
rather than finding and (somehow) typing the ligature.

Some superscripts are similar - I imagine there's a code point for the
"superscrip t st" in "1st" (though I can't find it offhand) and you'd
definitely want to convert that to "st".

NFKD normalization doesn't convert VULGAR FRACTION ONE QUARTER into
"1/4" - I wonder whether there's some way to do that?
[color=blue]
> After all they are called *compatibility* code points.[/color]

Yes, compatible with what the user types. 8-)

--
Richie Hindle
richie@entrian. com

**Luis P. Mendes** · May 10 '06, 03:16 AM

Re: ascii to latin1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
[color=blue][color=green]
>> When I used the "NFD" option, I came across many errors on these and
>> possibly other codes: \xba, \xc9, \xcd.[/color]
>
> What errors? normalize method is not supposed to give any errors. You
> mean it doesn't work as expected? Well, I have to admit that using
> normalize is a far from perfect way to implement search. The most
> advanced algorithm is published by Unicode guys:
> <http://www.unicode.org/reports/tr10/> If you read it you'll understand
> it's not so easy.
>[color=green]
>> I tried to use "NFKD" instead, and the number of errors was only about
>> half a dozen, for a universe of 600000+ names, on code \xbf.
>> It looks like I have to do a search and substitute using regular
>> expressions for these cases. Or is there a better way to do it?[/color]
>
> Perhaps you can use unicode translate method to map the characters that
> still give you problems to whatever you want.
>[/color]

Errors occur when I assign the result of ''.join(cp for cp in de_str if
not unicodedata.cat egory(cp).start swith('M')) to a variable. The same
happens with de_str. When I print the strings everything is ok.

Here's a short example of data:
115448,DAÇÃO
117788,DA 1º DE MO Nº 2

I used the following script to convert the data:
# -*- coding: iso8859-15 -*-

class Latin1ToAscii:

def abreFicheiro(se lf):
import csv
self.reader = csv.reader(open (self.input_fil e, "rb"))

def converter(self) :
import unicodedata
self.lista_csv = []
for row in self.reader:
s = unicode(row[1],"latin-1")
de_str = unicodedata.nor malize("NFD", s)
nome = ''.join(cp for cp in de_str if not \
unicodedata.cat egory(cp).start swith('M'))

linha_ascii = row[0] + "," + nome # *
print linha_ascii.enc ode("ascii")
self.lista_csv. append(linha_as cii)

def __init__(self):
self.input_file = 'nome_latin1.cs v'
self.output_fil e = 'nome_ascii.csv '

if __name__ == "__main__":
f = Latin1ToAscii()
f.abreFicheiro( )
f.converter()

And I got the following result:
$ python latin1_to_ascii .py
115448,DACAO
Traceback (most recent call last):
File "latin1_to_asci i.py", line 44, in ?
f.converter()
File "latin1_to_asci i.py", line 22, in converter
print linha_ascii.enc ode("ascii")
UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xba' in
position 11: ordinal not in range(128)

The script converted the ÇÃ from the first line, but not the º from the
second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
[u'115448,DAÇÃO'] element, which doesn't suit my needs.

Would you mind telling me what should I change?

Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYN7+Hn4 UHCY8rB8RAjcTAK CgEkZwCURgp/VrtthM1MBba+d7K ACfY9dj
xcHVL1BuhyrPV8+ 9Z5Q2AJQ=
=+AO0
-----END PGP SIGNATURE-----

**Peter Otten** · May 10 '06, 03:16 AM

Re: ascii to latin1

Luis P. Mendes wrote:
[color=blue]
> The script converted the ÇÃ from the first line, but not the º from the
> second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
> [u'115448,DAÇÃO'] element, which doesn't suit my needs.
>
> Would you mind telling me what should I change?[/color]

Sometimes you are faster if you put the gloves off. Just write the
translation table with the desired substitute for every non-ascii character
in the latin1 charset by hand and be done.

Cyril Kyree

**richie@entrian.com** · May 10 '06, 03:16 AM

Re: ascii to latin1

[Luis][color=blue]
> The script converted the ÇÃ from the first line, but not the º from
> the second one.[/color]

That's because º, 0xba, MASCULINE ORDINAL INDICATOR is classed as a
letter and not a diacritic:

500: Server error

http://www.fileformat.info/info/unicode/char/00ba/index.htm

You can't encode it in ascii because it's not an ascii character, and
the script doesn't remove it because it only removes diacritics.

I don't know what the best thing to do with it would be - could you use
latin-1 as your base encoding and leave it in there? I don't speak any
language that uses it, but I'd guess that anyone searching for eg. 5º
(forgive me if I have the gender wrong 8-) would actually type 5º -
are there any Italian/Spanish/Portuguese speakers here who can confirm
or deny that?

In the general case, you have to decide what happens to characters that
aren't diacritics and don't live in your base encoding - what happens
when a Chinese user searches for a Chinese character? Probably you
should just encode(base_enc oding, 'ignore').

--
Richie Hindle
richie@entrian. com

**Serge Orlov** · May 10 '06, 09:45 AM

Re: ascii to latin1

Luis P. Mendes wrote:[color=blue]
> Errors occur when I assign the result of ''.join(cp for cp in de_str if
> not unicodedata.cat egory(cp).start swith('M')) to a variable. The same
> happens with de_str. When I print the strings everything is ok.
>
> Here's a short example of data:
> 115448,DAÇÃO
> 117788,DA 1º DE MO Nº 2
>
> I used the following script to convert the data:
> # -*- coding: iso8859-15 -*-
>
> class Latin1ToAscii:
>
> def abreFicheiro(se lf):
> import csv
> self.reader = csv.reader(open (self.input_fil e, "rb"))
>
> def converter(self) :
> import unicodedata
> self.lista_csv = []
> for row in self.reader:
> s = unicode(row[1],"latin-1")
> de_str = unicodedata.nor malize("NFD", s)
> nome = ''.join(cp for cp in de_str if not \
> unicodedata.cat egory(cp).start swith('M'))
>
> linha_ascii = row[0] + "," + nome # *
> print linha_ascii.enc ode("ascii")
> self.lista_csv. append(linha_as cii)
>
>
> def __init__(self):
> self.input_file = 'nome_latin1.cs v'
> self.output_fil e = 'nome_ascii.csv '
>
> if __name__ == "__main__":
> f = Latin1ToAscii()
> f.abreFicheiro( )
> f.converter()
>
>
> And I got the following result:
> $ python latin1_to_ascii .py
> 115448,DACAO
> Traceback (most recent call last):
> File "latin1_to_asci i.py", line 44, in ?
> f.converter()
> File "latin1_to_asci i.py", line 22, in converter
> print linha_ascii.enc ode("ascii")
> UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xba' in
> position 11: ordinal not in range(128)
>
>
> The script converted the ÇÃ from the first line, but not the º fromthe
> second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
> [u'115448,DAÇÃO'] element, which doesn't suit my needs.
>
> Would you mind telling me what should I change?[/color]

Calling this process "latin1 to ascii" was a misnomer, sorry that I
used this phrase. It should be called "latin1 to search key", there is
no requirement that the key must be ascii, so change the corresponding
lines in your code:

linha_key = row[0] + "," + nome
print linha_key
self.lista_csv. append(linha_ke y.encode("latin-1")

With regards to º, Richie already gave you food for thoughts, if you
want "1 DE MO" to match "1º DE MO" remove that symbol from the key
(linha_key = linha_key.trans late({u"º": None}), if you don't want such
a fuzzy matching, keep it.

**Luis P. Mendes** · May 10 '06, 12:05 PM

Re: ascii to latin1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[color=blue]
>
> With regards to º, Richie already gave you food for thoughts, if you
> want "1 DE MO" to match "1º DE MO" remove that symbol from the key
> (linha_key = linha_key.trans late({u"º": None}), if you don't want such
> a fuzzy matching, keep it.
>[/color]
Thank you all for your help.

That was what I did. That symbol 'º' is not needded for the field.

It's working fine, now.

Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYdUGHn4 UHCY8rB8RAhWgAK CNqUaknEmiNlA05 0u5G+p4cTPGHwCg s7fu
7/5HMYDDo+sOP2QDe xIELn8=
=XiPL
-----END PGP SIGNATURE-----