ascii to latin1

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Luis P. Mendes

    ascii to latin1

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Hi,

    I'm developing a django based intranet web server that has a search page.

    Data contained in the database is mixed. Some of the words are
    accented, some are not but they should be. This is because the
    collection of data began a long time ago when ascii was the only way to go.

    The problem is users have to search more than once for some word,
    because the searched word can be or not be accented. If we consider
    that some expressions can have several letters that can be accented, the
    search effort is too much.

    I've searched the net for some kind of solution but couldn't find. I've
    just found for the opposite.

    example:
    if the word searched is 'televisão', I want that a search by either
    'televisao', 'televisão' or even 'télévisao' (this last one doesn't
    exist in Portuguese) is successful.

    So, instead of only one search, there will be several used.

    Is there anything already coded, or will I have to try to do it all by
    myself?


    Luis P. Mendes
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.1 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    iD8DBQFEX9yqHn4 UHCY8rB8RAovDAJ 90vllWjxfXN5bnN vg0OCKadbrfnwCf b4Hp
    2jmRFyNYySukPwY ACJ1TdM8=
    =hTr3
    -----END PGP SIGNATURE-----
  • Robert Kern

    #2
    Re: ascii to latin1

    Luis P. Mendes wrote:
    [color=blue]
    > example:
    > if the word searched is 'televisão', I want that a search by either
    > 'televisao', 'televisão' or even 'télévisao' (this last one doesn't
    > exist in Portuguese) is successful.[/color]

    The ICU library has the capability to transliterate strings via certain
    rulesets. One such ruleset would transliterate all of the above to 'televisao'.
    That transliteration could act as a normalization step akin to stemming.

    There are one or two Python bindings out there. Google for PyICU. I don't recall
    if it exposes the transliteration API or not.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco

    Comment

    • Rene Pijlman

      #3
      Re: ascii to latin1

      Luis P. Mendes:[color=blue]
      >I'm developing a django based intranet web server that has a search page.
      >
      >Data contained in the database is mixed. Some of the words are
      >accented, some are not but they should be. This is because the
      >collection of data began a long time ago when ascii was the only way to go.
      >
      >The problem is users have to search more than once for some word,
      >because the searched word can be or not be accented. If we consider
      >that some expressions can have several letters that can be accented, the
      >search effort is too much.[/color]

      I guess the best solution is to index all data in ASCII. That is, convert
      a field to ASCII (from accented character to its unaccented constituent)
      and index that.

      Then, on a search, you also need to unaccent the search phrase, and match
      it against the asciified index.

      --
      René Pijlman

      Comment

      • Serge Orlov

        #4
        Re: ascii to latin1

        Luis P. Mendes wrote:[color=blue]
        > -----BEGIN PGP SIGNED MESSAGE-----
        > Hash: SHA1
        >
        > Hi,
        >
        > I'm developing a django based intranet web server that has a search page.
        >
        > Data contained in the database is mixed. Some of the words are
        > accented, some are not but they should be. This is because the
        > collection of data began a long time ago when ascii was the only way to go.
        >
        > The problem is users have to search more than once for some word,
        > because the searched word can be or not be accented. If we consider
        > that some expressions can have several letters that can be accented, the
        > search effort is too much.
        >
        > I've searched the net for some kind of solution but couldn't find. I've
        > just found for the opposite.
        >
        > example:
        > if the word searched is 'televisão', I want that a search by either
        > 'televisao', 'televisão' or even 'télévisao' (this last one doesn't
        > exist in Portuguese) is successful.
        >
        > So, instead of only one search, there will be several used.
        >
        > Is there anything already coded, or will I have to try to do it all by
        > myself?[/color]

        You need to covert from latin1 to ascii not from ascii to latin1. The
        function below does that. Then you need to build database index not on
        latin1 text but on ascii text. After that convert user input to ascii
        and search.

        import unicodedata

        def search_key(s):
        de_str = unicodedata.nor malize("NFD", s)
        return ''.join(cp for cp in de_str if not
        unicodedata.cat egory(cp).start swith('M'))

        print search_key(u"te levisão")
        print search_key(u"té lévisao")

        ===== Result:
        televisao
        televisao

        Comment

        • Richie Hindle

          #5
          Re: ascii to latin1


          [Serge][color=blue]
          > def search_key(s):
          > de_str = unicodedata.nor malize("NFD", s)
          > return ''.join(cp for cp in de_str if not
          > unicodedata.cat egory(cp).start swith('M'))[/color]

          Lovely bit of code - thanks for posting it!

          You might want to use "NFKD" to normalize things like LATIN SMALL
          LIGATURE FI and subscript/superscript characters as well as diacritics.

          --
          Richie Hindle
          richie@entrian. com

          Comment

          • Luis P. Mendes

            #6
            Re: ascii to latin1

            -----BEGIN PGP SIGNED MESSAGE-----
            Hash: SHA1

            Richie Hindle escreveu:[color=blue]
            > [Serge][color=green]
            >> def search_key(s):
            >> de_str = unicodedata.nor malize("NFD", s)
            >> return ''.join(cp for cp in de_str if not
            >> unicodedata.cat egory(cp).start swith('M'))[/color]
            >
            > Lovely bit of code - thanks for posting it!
            >
            > You might want to use "NFKD" to normalize things like LATIN SMALL
            > LIGATURE FI and subscript/superscript characters as well as diacritics.
            >[/color]

            Thank you very much for your info. It's a very good aproach.

            When I used the "NFD" option, I came across many errors on these and
            possibly other codes: \xba, \xc9, \xcd.

            I tried to use "NFKD" instead, and the number of errors was only about
            half a dozen, for a universe of 600000+ names, on code \xbf.

            It looks like I have to do a search and substitute using regular
            expressions for these cases. Or is there a better way to do it?


            Luis P. Mendes
            -----BEGIN PGP SIGNATURE-----
            Version: GnuPG v1.4.1 (GNU/Linux)
            Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

            iD8DBQFEYINaHn4 UHCY8rB8RAqLKAJ 0cN7yRlzJSpmH7j lrWoyhUH1990wCg kxCW
            9d7f/FyHXoSfRUrbES0X KvU=
            =eAuO
            -----END PGP SIGNATURE-----

            Comment

            • Richie Hindle

              #7
              Re: ascii to latin1


              [Luis][color=blue]
              > When I used the "NFD" option, I came across many errors on these and
              > possibly other codes: \xba, \xc9, \xcd.[/color]

              What errors? This works fine for me, printing "Ecoute":

              import unicodedata
              def search_key(s):
              de_str = unicodedata.nor malize("NFD", s)
              return ''.join([cp for cp in de_str if not
              unicodedata.cat egory(cp).start swith('M')])
              print search_key(u"\x c9coute")

              Are you using unicode code point \xc9, or is that a byte in some
              encoding? Which encoding?

              --
              Richie

              Comment

              • Serge Orlov

                #8
                Re: ascii to latin1

                Richie Hindle wrote:[color=blue]
                > [Serge][color=green]
                > > def search_key(s):
                > > de_str = unicodedata.nor malize("NFD", s)
                > > return ''.join(cp for cp in de_str if not
                > > unicodedata.cat egory(cp).start swith('M'))[/color]
                >
                > Lovely bit of code - thanks for posting it![/color]

                Well, it is not so good. Please read my next message to Luis.
                [color=blue]
                >
                > You might want to use "NFKD" to normalize things like LATIN SMALL
                > LIGATURE FI and subscript/superscript characters as well as diacritics.[/color]

                IMHO It is perfectly acceptable to declare you don't interpret those
                symbols. After all they are called *compatibility* code points. I
                tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
                doesn't support it at all.

                NFKD form is also more tricky to use. It loses semantic of characters,
                for example if you have character "digit two" followed by "superscrip t
                digit two"; they look like 2 power 2, but NFKD will convert them into
                22 (twenty two), which is wrong. So if you want to use NFKD for search
                your will have to preprocess your data, for example inserting space
                between the twos.

                Comment

                • Serge Orlov

                  #9
                  Re: ascii to latin1

                  Luis P. Mendes wrote:[color=blue]
                  > -----BEGIN PGP SIGNED MESSAGE-----
                  > Hash: SHA1
                  >
                  > Richie Hindle escreveu:[color=green]
                  > > [Serge][color=darkred]
                  > >> def search_key(s):
                  > >> de_str = unicodedata.nor malize("NFD", s)
                  > >> return ''.join(cp for cp in de_str if not
                  > >> unicodedata.cat egory(cp).start swith('M'))[/color]
                  > >
                  > > Lovely bit of code - thanks for posting it!
                  > >
                  > > You might want to use "NFKD" to normalize things like LATIN SMALL
                  > > LIGATURE FI and subscript/superscript characters as well as diacritics.
                  > >[/color]
                  >
                  > Thank you very much for your info. It's a very good aproach.
                  >
                  > When I used the "NFD" option, I came across many errors on these and
                  > possibly other codes: \xba, \xc9, \xcd.[/color]

                  What errors? normalize method is not supposed to give any errors. You
                  mean it doesn't work as expected? Well, I have to admit that using
                  normalize is a far from perfect way to implement search. The most
                  advanced algorithm is published by Unicode guys:
                  <http://www.unicode.org/reports/tr10/> If you read it you'll understand
                  it's not so easy.
                  [color=blue]
                  >
                  > I tried to use "NFKD" instead, and the number of errors was only about
                  > half a dozen, for a universe of 600000+ names, on code \xbf.
                  > It looks like I have to do a search and substitute using regular
                  > expressions for these cases. Or is there a better way to do it?[/color]

                  Perhaps you can use unicode translate method to map the characters that
                  still give you problems to whatever you want.

                  Comment

                  • Richie Hindle

                    #10
                    Re: ascii to latin1


                    [Serge][color=blue]
                    > I have to admit that using
                    > normalize is a far from perfect way to implement search. The most
                    > advanced algorithm is published by Unicode guys:
                    > <http://www.unicode.org/reports/tr10/> If you read it you'll understand
                    > it's not so easy.[/color]

                    I only have to look at the length of the document to understand it's not
                    so easy. 8-) I'll take your two-line normalization function any day.
                    [color=blue]
                    > IMHO It is perfectly acceptable to declare you don't interpret those
                    > symbols. After all they are called *compatibility* code points. I
                    > tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
                    > doesn't support it at all. [...]
                    > if you have character "digit two" followed by "superscrip t
                    > digit two"; they look like 2 power 2, but NFKD will convert them into
                    > 22 (twenty two), which is wrong. So if you want to use NFKD for search
                    > your will have to preprocess your data, for example inserting space
                    > between the twos.[/color]

                    I'm not sure it's obvious that it's wrong. How might a user enter
                    "2<superscr ipt digit 2>" into a search box? They might enter a genuine
                    "<superscri pt digit 2>" in which case you're fine, or they might enter
                    "2^2" in which case it depends how you deal with punctuation. They
                    probably won't enter "2 2".

                    It's certainly not wrong in the case of ligatures like LATIN SMALL
                    LIGATURE FI - it's quite likely that the user will search for "fish"
                    rather than finding and (somehow) typing the ligature.

                    Some superscripts are similar - I imagine there's a code point for the
                    "superscrip t st" in "1st" (though I can't find it offhand) and you'd
                    definitely want to convert that to "st".

                    NFKD normalization doesn't convert VULGAR FRACTION ONE QUARTER into
                    "1/4" - I wonder whether there's some way to do that?
                    [color=blue]
                    > After all they are called *compatibility* code points.[/color]

                    Yes, compatible with what the user types. 8-)

                    --
                    Richie Hindle
                    richie@entrian. com

                    Comment

                    • Luis P. Mendes

                      #11
                      Re: ascii to latin1

                      -----BEGIN PGP SIGNED MESSAGE-----
                      Hash: SHA1
                      [color=blue][color=green]
                      >> When I used the "NFD" option, I came across many errors on these and
                      >> possibly other codes: \xba, \xc9, \xcd.[/color]
                      >
                      > What errors? normalize method is not supposed to give any errors. You
                      > mean it doesn't work as expected? Well, I have to admit that using
                      > normalize is a far from perfect way to implement search. The most
                      > advanced algorithm is published by Unicode guys:
                      > <http://www.unicode.org/reports/tr10/> If you read it you'll understand
                      > it's not so easy.
                      >[color=green]
                      >> I tried to use "NFKD" instead, and the number of errors was only about
                      >> half a dozen, for a universe of 600000+ names, on code \xbf.
                      >> It looks like I have to do a search and substitute using regular
                      >> expressions for these cases. Or is there a better way to do it?[/color]
                      >
                      > Perhaps you can use unicode translate method to map the characters that
                      > still give you problems to whatever you want.
                      >[/color]

                      Errors occur when I assign the result of ''.join(cp for cp in de_str if
                      not unicodedata.cat egory(cp).start swith('M')) to a variable. The same
                      happens with de_str. When I print the strings everything is ok.

                      Here's a short example of data:
                      115448,DAÇÃO
                      117788,DA 1º DE MO Nº 2

                      I used the following script to convert the data:
                      # -*- coding: iso8859-15 -*-

                      class Latin1ToAscii:

                      def abreFicheiro(se lf):
                      import csv
                      self.reader = csv.reader(open (self.input_fil e, "rb"))

                      def converter(self) :
                      import unicodedata
                      self.lista_csv = []
                      for row in self.reader:
                      s = unicode(row[1],"latin-1")
                      de_str = unicodedata.nor malize("NFD", s)
                      nome = ''.join(cp for cp in de_str if not \
                      unicodedata.cat egory(cp).start swith('M'))

                      linha_ascii = row[0] + "," + nome # *
                      print linha_ascii.enc ode("ascii")
                      self.lista_csv. append(linha_as cii)


                      def __init__(self):
                      self.input_file = 'nome_latin1.cs v'
                      self.output_fil e = 'nome_ascii.csv '

                      if __name__ == "__main__":
                      f = Latin1ToAscii()
                      f.abreFicheiro( )
                      f.converter()


                      And I got the following result:
                      $ python latin1_to_ascii .py
                      115448,DACAO
                      Traceback (most recent call last):
                      File "latin1_to_asci i.py", line 44, in ?
                      f.converter()
                      File "latin1_to_asci i.py", line 22, in converter
                      print linha_ascii.enc ode("ascii")
                      UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xba' in
                      position 11: ordinal not in range(128)


                      The script converted the ÇÃ from the first line, but not the º from the
                      second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
                      [u'115448,DAÇÃO'] element, which doesn't suit my needs.

                      Would you mind telling me what should I change?


                      Luis P. Mendes
                      -----BEGIN PGP SIGNATURE-----
                      Version: GnuPG v1.4.1 (GNU/Linux)
                      Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

                      iD8DBQFEYN7+Hn4 UHCY8rB8RAjcTAK CgEkZwCURgp/VrtthM1MBba+d7K ACfY9dj
                      xcHVL1BuhyrPV8+ 9Z5Q2AJQ=
                      =+AO0
                      -----END PGP SIGNATURE-----

                      Comment

                      • Peter Otten

                        #12
                        Re: ascii to latin1

                        Luis P. Mendes wrote:
                        [color=blue]
                        > The script converted the ÇÃ from the first line, but not the º from the
                        > second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
                        > [u'115448,DAÇÃO'] element, which doesn't suit my needs.
                        >
                        > Would you mind telling me what should I change?[/color]

                        Sometimes you are faster if you put the gloves off. Just write the
                        translation table with the desired substitute for every non-ascii character
                        in the latin1 charset by hand and be done.

                        Cyril Kyree


                        Comment

                        • richie@entrian.com

                          #13
                          Re: ascii to latin1

                          [Luis][color=blue]
                          > The script converted the ÇÃ from the first line, but not the º from
                          > the second one.[/color]

                          That's because º, 0xba, MASCULINE ORDINAL INDICATOR is classed as a
                          letter and not a diacritic:



                          You can't encode it in ascii because it's not an ascii character, and
                          the script doesn't remove it because it only removes diacritics.

                          I don't know what the best thing to do with it would be - could you use
                          latin-1 as your base encoding and leave it in there? I don't speak any
                          language that uses it, but I'd guess that anyone searching for eg. 5º
                          (forgive me if I have the gender wrong 8-) would actually type 5º -
                          are there any Italian/Spanish/Portuguese speakers here who can confirm
                          or deny that?

                          In the general case, you have to decide what happens to characters that
                          aren't diacritics and don't live in your base encoding - what happens
                          when a Chinese user searches for a Chinese character? Probably you
                          should just encode(base_enc oding, 'ignore').

                          --
                          Richie Hindle
                          richie@entrian. com

                          Comment

                          • Serge Orlov

                            #14
                            Re: ascii to latin1

                            Luis P. Mendes wrote:[color=blue]
                            > Errors occur when I assign the result of ''.join(cp for cp in de_str if
                            > not unicodedata.cat egory(cp).start swith('M')) to a variable. The same
                            > happens with de_str. When I print the strings everything is ok.
                            >
                            > Here's a short example of data:
                            > 115448,DAÇÃO
                            > 117788,DA 1º DE MO Nº 2
                            >
                            > I used the following script to convert the data:
                            > # -*- coding: iso8859-15 -*-
                            >
                            > class Latin1ToAscii:
                            >
                            > def abreFicheiro(se lf):
                            > import csv
                            > self.reader = csv.reader(open (self.input_fil e, "rb"))
                            >
                            > def converter(self) :
                            > import unicodedata
                            > self.lista_csv = []
                            > for row in self.reader:
                            > s = unicode(row[1],"latin-1")
                            > de_str = unicodedata.nor malize("NFD", s)
                            > nome = ''.join(cp for cp in de_str if not \
                            > unicodedata.cat egory(cp).start swith('M'))
                            >
                            > linha_ascii = row[0] + "," + nome # *
                            > print linha_ascii.enc ode("ascii")
                            > self.lista_csv. append(linha_as cii)
                            >
                            >
                            > def __init__(self):
                            > self.input_file = 'nome_latin1.cs v'
                            > self.output_fil e = 'nome_ascii.csv '
                            >
                            > if __name__ == "__main__":
                            > f = Latin1ToAscii()
                            > f.abreFicheiro( )
                            > f.converter()
                            >
                            >
                            > And I got the following result:
                            > $ python latin1_to_ascii .py
                            > 115448,DACAO
                            > Traceback (most recent call last):
                            > File "latin1_to_asci i.py", line 44, in ?
                            > f.converter()
                            > File "latin1_to_asci i.py", line 22, in converter
                            > print linha_ascii.enc ode("ascii")
                            > UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xba' in
                            > position 11: ordinal not in range(128)
                            >
                            >
                            > The script converted the ÇÃ from the first line, but not the º fromthe
                            > second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
                            > [u'115448,DAÇÃO'] element, which doesn't suit my needs.
                            >
                            > Would you mind telling me what should I change?[/color]

                            Calling this process "latin1 to ascii" was a misnomer, sorry that I
                            used this phrase. It should be called "latin1 to search key", there is
                            no requirement that the key must be ascii, so change the corresponding
                            lines in your code:

                            linha_key = row[0] + "," + nome
                            print linha_key
                            self.lista_csv. append(linha_ke y.encode("latin-1")

                            With regards to º, Richie already gave you food for thoughts, if you
                            want "1 DE MO" to match "1º DE MO" remove that symbol from the key
                            (linha_key = linha_key.trans late({u"º": None}), if you don't want such
                            a fuzzy matching, keep it.

                            Comment

                            • Luis P. Mendes

                              #15
                              Re: ascii to latin1

                              -----BEGIN PGP SIGNED MESSAGE-----
                              Hash: SHA1

                              [color=blue]
                              >
                              > With regards to º, Richie already gave you food for thoughts, if you
                              > want "1 DE MO" to match "1º DE MO" remove that symbol from the key
                              > (linha_key = linha_key.trans late({u"º": None}), if you don't want such
                              > a fuzzy matching, keep it.
                              >[/color]
                              Thank you all for your help.

                              That was what I did. That symbol 'º' is not needded for the field.

                              It's working fine, now.


                              Luis P. Mendes
                              -----BEGIN PGP SIGNATURE-----
                              Version: GnuPG v1.4.1 (GNU/Linux)
                              Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

                              iD8DBQFEYdUGHn4 UHCY8rB8RAhWgAK CNqUaknEmiNlA05 0u5G+p4cTPGHwCg s7fu
                              7/5HMYDDo+sOP2QDe xIELn8=
                              =XiPL
                              -----END PGP SIGNATURE-----

                              Comment

                              Working...