An attempt at guessing the encoding of a (non-unicode) string

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Christos TZOTZIOY Georgiou

    An attempt at guessing the encoding of a (non-unicode) string

    This is a subject that comes up fairly often. Last night, I had the
    following idea, for which I would like feedback from you.

    This could be implemented as a function in codecs.py (let's call it
    "wild_guess "), that is based on some pre-calculated data. These
    pre-calculated data would be produced as follows:

    1. Create a dictionary (key: encoding, value: set of valid bytes for the
    encoding)

    1a. the sets can be constructed by trial and error:

    def valid_bytes(enc oding):
    result= set()
    for byte in xrange(256):
    char= chr(byte)
    try:
    char.decode(enc oding)
    except UnicodeDecodeEr ror:
    pass
    else:
    result.add(char )
    return result

    2. for every 8-bit encoding, some "representative " text is given (the
    longer, the better)

    2a. the following function is a quick generator of all two-char
    sequences from its string argument. can be used both for the production
    of the pre-calculated data and for the analysis of a given string in the
    'wild_guess' function.

    def str_window(text ):
    return itertools.imap(
    text.__getslice __, xrange(0, len(s)-1), xrange(2, len(s)+1)
    )

    So for every encoding and 'representative ' text, a bag of two-char
    sequences and their frequencies is calculated. {frequencies[encoding] =
    dict(key: two-chars, value: count)}

    2b. do a lengthy comparison of the bags in order to find the most common
    two-char sequences that, as a set, can be considered unique for the
    specific encoding.

    2c. For every encoding, keep only a set of the (chosen in step 2b)
    two-char sequences that were judged as 'representative '. Store these
    calculated sets plus those from step 1a as python code in a helper
    module to be imported from codecs.py for the wild_guess function
    (reproduce the helper module every time some 'representative ' text is
    added or modified).

    3. write the wild_guess function

    3a. the function 'wild_guess' would first construct a set from its
    argument:

    sample_set= set(argument)

    and by set operations against the sets from step 1a, we can exclude
    codecs where the sample set is not a subset of the encoding valid set.
    I don't expect that this step would exclude many encodings, but I think
    it should not be skipped.

    3b. pass the argument through the str_window function, and construct a
    set of all two-char sequencies

    3c. from all sets from step 2c, find the one whose intersection with set
    from 3b is longest as a ratio of len(intersectio n)/len(encoding_se t),
    and suggest the relevant encoding.

    What do you think? I can't test whether that would work unless I have
    'representative ' texts for various encodings. Please feel free to help
    or bash :)

    PS I know how generic 'representative ' is, and how hard it is to qualify
    some text as such, therefore the quotes. That is why I said 'the
    longer, the better'.
    --
    TZOTZIOY, I speak England very best,
    Ils sont fous ces Redmontains! --Harddix
  • Jon Willeke

    #2
    Re: An attempt at guessing the encoding of a (non-unicode) string

    Christos TZOTZIOY Georgiou wrote:[color=blue]
    > This is a subject that comes up fairly often. Last night, I had the
    > following idea, for which I would like feedback from you.
    >
    > This could be implemented as a function in codecs.py (let's call it
    > "wild_guess "), that is based on some pre-calculated data. These
    > pre-calculated data would be produced as follows:[/color]
    ....[color=blue]
    > What do you think? I can't test whether that would work unless I have
    > 'representative ' texts for various encodings. Please feel free to help
    > or bash :)[/color]

    The representative text would, in some circles, be called a training
    corpus. See the Natural Language Toolkit for some modules that may help
    you prototype this approach:

    <http://nltk.sf.net/>

    In particular, check out the probability tutorial.

    Comment

    • Christos TZOTZIOY Georgiou

      #3
      Re: An attempt at guessing the encoding of a (non-unicode) string

      On Fri, 02 Apr 2004 15:05:42 GMT, rumours say that Jon Willeke
      <j.dot.willeke@ verizon.dot.net > might have written:
      [color=blue]
      >Christos TZOTZIOY Georgiou wrote:[/color]
      <snip>[color=blue][color=green]
      >>
      >> This could be implemented as a function in codecs.py (let's call it
      >> "wild_guess "), that is based on some pre-calculated data. These
      >> pre-calculated data would be produced as follows:[/color]
      >...[/color]
      <snip>

      [Jon][color=blue]
      >The representative text would, in some circles, be called a training
      >corpus. See the Natural Language Toolkit for some modules that may help
      >you prototype this approach:
      >
      > <http://nltk.sf.net/>
      >
      >In particular, check out the probability tutorial.[/color]

      Thanks for the hint, and I am browsing the documentation now. However,
      I'd like to create something that would not be dependent on external
      python libraries, so that anyone interested would just download a small
      module that would do the job, hopefully good.
      --
      TZOTZIOY, I speak England very best,
      Ils sont fous ces Redmontains! --Harddix

      Comment

      • David Eppstein

        #4
        Re: An attempt at guessing the encoding of a (non-unicode) string

        I've been getting decent results by a much simpler approach:
        count the number of characters for which the encoding produces a symbol
        c for which c.isalpha() or c.isspace(), subtract a large penalty if
        using the encoding leads to UnicodeDecodeEr ror, and take the encoding
        with the largest count.

        --
        David Eppstein http://www.ics.uci.edu/~eppstein/
        Univ. of California, Irvine, School of Information & Computer Science

        Comment

        • John Roth

          #5
          Re: An attempt at guessing the encoding of a (non-unicode) string


          "David Eppstein" <eppstein@ics.u ci.edu> wrote in message
          news:eppstein-8C467F.14490702 042004@news.ser vice.uci.edu...[color=blue]
          > I've been getting decent results by a much simpler approach:
          > count the number of characters for which the encoding produces a symbol
          > c for which c.isalpha() or c.isspace(), subtract a large penalty if
          > using the encoding leads to UnicodeDecodeEr ror, and take the encoding
          > with the largest count.[/color]

          Shouldn't that be isalphanum()? Or does your data not have
          very many numbers?

          John Roth[color=blue]
          >
          > --
          > David Eppstein http://www.ics.uci.edu/~eppstein/
          > Univ. of California, Irvine, School of Information & Computer Science[/color]


          Comment

          • David Eppstein

            #6
            Re: An attempt at guessing the encoding of a (non-unicode) string

            In article <106thmedmq162c e@news.supernew s.com>,
            "John Roth" <newsgroups@jhr othjr.com> wrote:
            [color=blue]
            > "David Eppstein" <eppstein@ics.u ci.edu> wrote in message
            > news:eppstein-8C467F.14490702 042004@news.ser vice.uci.edu...[color=green]
            > > I've been getting decent results by a much simpler approach:
            > > count the number of characters for which the encoding produces a symbol
            > > c for which c.isalpha() or c.isspace(), subtract a large penalty if
            > > using the encoding leads to UnicodeDecodeEr ror, and take the encoding
            > > with the largest count.[/color]
            >
            > Shouldn't that be isalphanum()? Or does your data not have
            > very many numbers?[/color]

            It's only important if your text has many code positions which produce a
            digit in one encoding and not in another, and which are hard to
            disambiguate using isalpha() alone. I haven't encountered that
            situation.

            --
            David Eppstein http://www.ics.uci.edu/~eppstein/
            Univ. of California, Irvine, School of Information & Computer Science

            Comment

            • Roger Binns

              #7
              Re: An attempt at guessing the encoding of a (non-unicode) string

              Christos TZOTZIOY Georgiou wrote:[color=blue]
              > This could be implemented as a function in codecs.py (let's call it
              > "wild_guess "), that is based on some pre-calculated data.[/color]

              Windows already has a related function:



              Read more about it here:



              Roger


              Comment

              • Christos TZOTZIOY Georgiou

                #8
                Re: An attempt at guessing the encoding of a (non-unicode) string

                On Sat, 3 Apr 2004 12:22:05 -0800, rumours say that "Roger Binns"
                <rogerb@rogerbi nns.com> might have written:
                [color=blue]
                >Christos TZOTZIOY Georgiou wrote:[color=green]
                >> This could be implemented as a function in codecs.py (let's call it
                >> "wild_guess "), that is based on some pre-calculated data.[/color][/color]
                [color=blue]
                >Windows already has a related function:
                >
                >http://msdn.microsoft.com/library/de...icode_81np.asp[/color]

                As far as I understand, this function tests whether its argument is a
                valid Unicode text, so it has little to do with the issue I brought up:
                take a python string (8-bit bytes) and try to guess its encoding (eg,
                iso8859-1, iso8859-7 etc).

                There must be a similar function used for the "auto guess encoding"
                function of the MS Internet Explorer, however:

                1. even if it is exported and usable under windows, it is not platform
                independent

                2. its guessing success rate (until IE 5.5 which I happen to use) is not
                very high

                <snip>

                Thanks for your reply, anyway.
                --
                TZOTZIOY, I speak England very best,
                Ils sont fous ces Redmontains! --Harddix

                Comment

                • Christos TZOTZIOY Georgiou

                  #9
                  Re: An attempt at guessing the encoding of a (non-unicode) string

                  On Fri, 02 Apr 2004 14:49:07 -0800, rumours say that David Eppstein
                  <eppstein@ics.u ci.edu> might have written:
                  [color=blue]
                  >I've been getting decent results by a much simpler approach:
                  >count the number of characters for which the encoding produces a symbol
                  >c for which c.isalpha() or c.isspace(), subtract a large penalty if
                  >using the encoding leads to UnicodeDecodeEr ror, and take the encoding
                  >with the largest count.[/color]

                  Somebody (by email only so far) has suggested that spambayes could be
                  used to the task... perhaps they're right, however this is not as simple
                  and independent a solution I would like to deliver.

                  I would believe that your idea of a score is a good one; I feel that the
                  key should be two-char combinations, but I'll have to compare the
                  success rate of both one-char and two-char keys.

                  I'll try to search for "representative " texts on the web for as many
                  encodings as I can; any pointers, links from non-english speakers would
                  be welcome in the thread.
                  --
                  TZOTZIOY, I speak England very best,
                  Ils sont fous ces Redmontains! --Harddix

                  Comment

                  • Seo Sanghyeon

                    #10
                    Re: An attempt at guessing the encoding of a (non-unicode) string

                    I think you will find Mozilla's charset autodetection method
                    interesting.

                    A composite approach to language/encoding detection


                    Perhaps this can be used with PyXPCOM. I don't know.

                    Comment

                    • David Eppstein

                      #11
                      Re: An attempt at guessing the encoding of a (non-unicode) string

                      In article <6pa270h031thgl eo4a31itktb95n9 e4rvm@4ax.com>,
                      Christos "TZOTZIOY" Georgiou <tzot@sil-tec.gr> wrote:
                      [color=blue][color=green]
                      > >I've been getting decent results by a much simpler approach:
                      > >count the number of characters for which the encoding produces a symbol
                      > >c for which c.isalpha() or c.isspace(), subtract a large penalty if
                      > >using the encoding leads to UnicodeDecodeEr ror, and take the encoding
                      > >with the largest count.[/color]
                      >
                      > Somebody (by email only so far) has suggested that spambayes could be
                      > used to the task... perhaps they're right, however this is not as simple
                      > and independent a solution I would like to deliver.
                      >
                      > I would believe that your idea of a score is a good one; I feel that the
                      > key should be two-char combinations, but I'll have to compare the
                      > success rate of both one-char and two-char keys.
                      >
                      > I'll try to search for "representative " texts on the web for as many
                      > encodings as I can; any pointers, links from non-english speakers would
                      > be welcome in the thread.[/color]

                      BTW, if you're going to implement the single-char version, at least for
                      encodings that translate one byte -> one unicode position (e.g., not
                      utf8), and your texts are large enough, it will be faster to precompute
                      a table of byte frequencies in the text and then compute the score by
                      summing the frequencies of alphabetic bytes.

                      --
                      David Eppstein http://www.ics.uci.edu/~eppstein/
                      Univ. of California, Irvine, School of Information & Computer Science

                      Comment

                      • Christos TZOTZIOY Georgiou

                        #12
                        Re: An attempt at guessing the encoding of a (non-unicode) string

                        On 5 Apr 2004 08:14:54 -0700, rumours say that unendliche@hanm ail.net
                        (Seo Sanghyeon) might have written:
                        [color=blue]
                        >I think you will find Mozilla's charset autodetection method
                        >interesting.
                        >
                        >A composite approach to language/encoding detection
                        >http://www.mozilla.org/projects/intl...Detection.html[/color]

                        Thank you!
                        [color=blue]
                        >Perhaps this can be used with PyXPCOM. I don't know.[/color]

                        Neither do I...
                        --
                        TZOTZIOY, I speak England very best,
                        Ils sont fous ces Redmontains! --Harddix

                        Comment

                        • Christos TZOTZIOY Georgiou

                          #13
                          Re: An attempt at guessing the encoding of a (non-unicode) string

                          On Mon, 05 Apr 2004 13:37:34 -0700, rumours say that David Eppstein
                          <eppstein@ics.u ci.edu> might have written:
                          [color=blue]
                          >BTW, if you're going to implement the single-char version, at least for
                          >encodings that translate one byte -> one unicode position (e.g., not
                          >utf8), and your texts are large enough, it will be faster to precompute
                          >a table of byte frequencies in the text and then compute the score by
                          >summing the frequencies of alphabetic bytes.[/color]

                          Thanks for the pointer, David. However, as it often happens, I came
                          second (or, probably, n-th :). Seo Sanghyeon sent a URL that includes a
                          two-char proposal, and it provides an algorithm in section 4.7.1 that I
                          find appropriate for this matter:


                          --
                          TZOTZIOY, I speak England very best,
                          Ils sont fous ces Redmontains! --Harddix

                          Comment

                          Working...