An attempt at guessing the encoding of a (non-unicode) string

**Jon Willeke** · Jul 18 '05, 10:00 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

Christos TZOTZIOY Georgiou wrote:[color=blue]
> This is a subject that comes up fairly often. Last night, I had the
> following idea, for which I would like feedback from you.
>
> This could be implemented as a function in codecs.py (let's call it
> "wild_guess "), that is based on some pre-calculated data. These
> pre-calculated data would be produced as follows:[/color]
....[color=blue]
> What do you think? I can't test whether that would work unless I have
> 'representative ' texts for various encodings. Please feel free to help
> or bash :)[/color]

The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.

**Christos TZOTZIOY Georgiou** · Jul 18 '05, 10:00 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

On Fri, 02 Apr 2004 15:05:42 GMT, rumours say that Jon Willeke
<j.dot.willeke@ verizon.dot.net > might have written:
[color=blue]
>Christos TZOTZIOY Georgiou wrote:[/color]
<snip>[color=blue][color=green]
>>
>> This could be implemented as a function in codecs.py (let's call it
>> "wild_guess "), that is based on some pre-calculated data. These
>> pre-calculated data would be produced as follows:[/color]
>...[/color]
<snip>

[Jon][color=blue]
>The representative text would, in some circles, be called a training
>corpus. See the Natural Language Toolkit for some modules that may help
>you prototype this approach:
>
> <http://nltk.sf.net/>
>
>In particular, check out the probability tutorial.[/color]

Thanks for the hint, and I am browsing the documentation now. However,
I'd like to create something that would not be dependent on external
python libraries, so that anyone interested would just download a small
module that would do the job, hopefully good.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

**David Eppstein** · Jul 18 '05, 10:00 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeEr ror, and take the encoding
with the largest count.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

**John Roth** · Jul 18 '05, 10:01 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

"David Eppstein" <eppstein@ics.u ci.edu> wrote in message
news:eppstein-8C467F.14490702 042004@news.ser vice.uci.edu...[color=blue]
> I've been getting decent results by a much simpler approach:
> count the number of characters for which the encoding produces a symbol
> c for which c.isalpha() or c.isspace(), subtract a large penalty if
> using the encoding leads to UnicodeDecodeEr ror, and take the encoding
> with the largest count.[/color]

Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

John Roth[color=blue]
>
> --
> David Eppstein http://www.ics.uci.edu/~eppstein/
> Univ. of California, Irvine, School of Information & Computer Science[/color]

**David Eppstein** · Jul 18 '05, 10:01 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

In article <106thmedmq162c e@news.supernew s.com>,
"John Roth" <newsgroups@jhr othjr.com> wrote:
[color=blue]
> "David Eppstein" <eppstein@ics.u ci.edu> wrote in message
> news:eppstein-8C467F.14490702 042004@news.ser vice.uci.edu...[color=green]
> > I've been getting decent results by a much simpler approach:
> > count the number of characters for which the encoding produces a symbol
> > c for which c.isalpha() or c.isspace(), subtract a large penalty if
> > using the encoding leads to UnicodeDecodeEr ror, and take the encoding
> > with the largest count.[/color]
>
> Shouldn't that be isalphanum()? Or does your data not have
> very many numbers?[/color]

It's only important if your text has many code positions which produce a
digit in one encoding and not in another, and which are hard to
disambiguate using isalpha() alone. I haven't encountered that
situation.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

**Roger Binns** · Jul 18 '05, 10:01 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

Christos TZOTZIOY Georgiou wrote:[color=blue]
> This could be implemented as a function in codecs.py (let's call it
> "wild_guess "), that is based on some pre-calculated data.[/color]

Windows already has a related function:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp

Read more about it here:

The resource cannot be found.

http://weblogs.asp.net/oldnewthing/archive/2004/03/24/95235.aspx

Roger

**Christos TZOTZIOY Georgiou** · Jul 18 '05, 10:04 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

On Sat, 3 Apr 2004 12:22:05 -0800, rumours say that "Roger Binns"
<rogerb@rogerbi nns.com> might have written:
[color=blue]
>Christos TZOTZIOY Georgiou wrote:[color=green]
>> This could be implemented as a function in codecs.py (let's call it
>> "wild_guess "), that is based on some pre-calculated data.[/color][/color]
[color=blue]
>Windows already has a related function:
>
>http://msdn.microsoft.com/library/de...icode_81np.asp[/color]

As far as I understand, this function tests whether its argument is a
valid Unicode text, so it has little to do with the issue I brought up:
take a python string (8-bit bytes) and try to guess its encoding (eg,
iso8859-1, iso8859-7 etc).

There must be a similar function used for the "auto guess encoding"
function of the MS Internet Explorer, however:

1. even if it is exported and usable under windows, it is not platform
independent

2. its guessing success rate (until IE 5.5 which I happen to use) is not
very high

<snip>

Thanks for your reply, anyway.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

**Christos TZOTZIOY Georgiou** · Jul 18 '05, 10:04 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

On Fri, 02 Apr 2004 14:49:07 -0800, rumours say that David Eppstein
<eppstein@ics.u ci.edu> might have written:
[color=blue]
>I've been getting decent results by a much simpler approach:
>count the number of characters for which the encoding produces a symbol
>c for which c.isalpha() or c.isspace(), subtract a large penalty if
>using the encoding leads to UnicodeDecodeEr ror, and take the encoding
>with the largest count.[/color]

Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative " texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

**Seo Sanghyeon** · Jul 18 '05, 10:04 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

I think you will find Mozilla's charset autodetection method
interesting.

A composite approach to language/encoding detection

404: Page Not Found

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Perhaps this can be used with PyXPCOM. I don't know.

**David Eppstein** · Jul 18 '05, 10:05 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

In article <6pa270h031thgl eo4a31itktb95n9 e4rvm@4ax.com>,
Christos "TZOTZIOY" Georgiou <tzot@sil-tec.gr> wrote:
[color=blue][color=green]
> >I've been getting decent results by a much simpler approach:
> >count the number of characters for which the encoding produces a symbol
> >c for which c.isalpha() or c.isspace(), subtract a large penalty if
> >using the encoding leads to UnicodeDecodeEr ror, and take the encoding
> >with the largest count.[/color]
>
> Somebody (by email only so far) has suggested that spambayes could be
> used to the task... perhaps they're right, however this is not as simple
> and independent a solution I would like to deliver.
>
> I would believe that your idea of a score is a good one; I feel that the
> key should be two-char combinations, but I'll have to compare the
> success rate of both one-char and two-char keys.
>
> I'll try to search for "representative " texts on the web for as many
> encodings as I can; any pointers, links from non-english speakers would
> be welcome in the thread.[/color]

BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

**Christos TZOTZIOY Georgiou** · Jul 18 '05, 10:07 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

On 5 Apr 2004 08:14:54 -0700, rumours say that unendliche@hanm ail.net
(Seo Sanghyeon) might have written:
[color=blue]
>I think you will find Mozilla's charset autodetection method
>interesting.
>
>A composite approach to language/encoding detection
>http://www.mozilla.org/projects/intl...Detection.html[/color]

Thank you!
[color=blue]
>Perhaps this can be used with PyXPCOM. I don't know.[/color]

Neither do I...
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

**Christos TZOTZIOY Georgiou** · Jul 18 '05, 10:07 AM

Re: An attempt at guessing the encoding of a (non-unicode) string

On Mon, 05 Apr 2004 13:37:34 -0700, rumours say that David Eppstein
<eppstein@ics.u ci.edu> might have written:
[color=blue]
>BTW, if you're going to implement the single-char version, at least for
>encodings that translate one byte -> one unicode position (e.g., not
>utf8), and your texts are large enough, it will be faster to precompute
>a table of byte frequencies in the text and then compute the score by
>summing the frequencies of alphabetic bytes.[/color]

Thanks for the pointer, David. However, as it often happens, I came
second (or, probably, n-th :). Seo Sanghyeon sent a URL that includes a
two-char proposal, and it provides an algorithm in section 4.7.1 that I
find appropriate for this matter:

404: Page Not Found

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

An attempt at guessing the encoding of a (non-unicode) string

An attempt at guessing the encoding of a (non-unicode) string

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment