Howto: Detect encodings?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • R. Rajesh Jeba Anbiah

    Howto: Detect encodings?

    Here is a nice code to detect utf-8
    <http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
    logic behind the script. If anyone knows that please share.

    Particularly I would like to detect other encodings too. So, I would
    like to know the logic.

    For example these texts are in TSCII encoding (for Tamil):
    Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯í¸Ç¡ø ÀÊì¸ ÓÊ󾡸, ¯í¸û ¯Ä¡Å¢Â¢à ¸ ±ó¾
    Å¢¾ Á¡üÈò¨¾à ”õ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢Õà ¬ÃŒÃµ
    ¾Á¢úô Àì¸í¸¨Çà ² ¾¨¼Â¢ýÈ ¢Ã´ ÀÊì¸Ä¡õ.

    Any ideas? TIA.

    --
    | Just another PHP saint |
    Email: rrjanbiah-at-Y!com
  • Chung Leong

    #2
    Re: Howto: Detect encodings?

    "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
    news:abc4d8b8.0 406090618.bab78 e5@posting.goog le.com...[color=blue]
    > Here is a nice code to detect utf-8
    > <http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
    > logic behind the script. If anyone knows that please share.[/color]

    The code tries to decode the UTF8 text. When it runs into an error, then
    it's not UTF8.
    [color=blue]
    > For example these texts are in TSCII encoding (for Tamil):
    > Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯í¸Ç¡ø ÀÊì¸ ÓÊ󾡸, ¯í¸û ¯Ä¡Å¢Â¢ø ±ó¾
    > Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
    > ¾Á¢úô Àì¸í¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.[/color]

    No easy way to do it. The question is, are you trying to distinguish between
    different possible ways of encoding Tamil or identify TSCII from all
    possible encodings?


    Comment

    • Gerard van Wilgen

      #3
      Re: Howto: Detect encodings?


      "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
      news:abc4d8b8.0 406090618.bab78 e5@posting.goog le.com...[color=blue]
      > Here is a nice code to detect utf-8
      > <http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
      > logic behind the script. If anyone knows that please share.
      >
      > Particularly I would like to detect other encodings too. So, I would
      > like to know the logic.
      >
      > For example these texts are in TSCII encoding (for Tamil):
      > Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯í¸Ç¡ø ÀÊì¸ ÓÊ󾡸, ¯í¸û ¯Ä¡Å¢Â¢à ¸ ±ó¾
      > Å¢¾ Á¡üÈò¨¾à ”õ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢Õà ¬ÃŒÃµ
      > ¾Á¢úô Àì¸í¸¨Çà ² ¾¨¼Â¢ýÈ ¢Ã´ ÀÊì¸Ä¡õ.
      >
      > Any ideas? TIA.[/color]

      A text that is not encoded in utf-8 will usually contain many byte sequences
      that are invalid in utf-8. Encodings like TSCII are much more difficult to
      detect, because every possible byte sequence would be valid (even though it
      would not necessarily be a meaningful character sequence for a human
      reader).

      When I have a text with an unknown encoding I simply load it in an editor
      that supports many encodings, and then try them out until I have found the
      setting that causes the text to become readable. Writing a script that can
      detect the encoding is obviously very difficult.

      I should say, forget it. It is not worth the trouble.

      Gerard van Wilgen
      --
      www.majstro.com (On-line translation dictionary / Enreta tradukvortaro)
      www.travlang.com/Ergane (Free translation dictionary for Windows / Senpaga
      tradukvortaro por Windows)

      Comment

      • R. Rajesh Jeba Anbiah

        #4
        Re: Howto: Detect encodings?

        "Chung Leong" <chernyshevsky@ hotmail.com> wrote in message news:<SM6dne6GE s2kAFrdRVn-vw@comcast.com> ...[color=blue]
        > "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
        > news:abc4d8b8.0 406090618.bab78 e5@posting.goog le.com...[color=green]
        > > Here is a nice code to detect utf-8
        > > <http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
        > > logic behind the script. If anyone knows that please share.[/color]
        >
        > The code tries to decode the UTF8 text. When it runs into an error, then
        > it's not UTF8.[/color]

        Thanks for the info/logic. Though I'm bit aware of unicode, this is
        the first time I'm putting my hands on it... It's bit kinda pain as
        PHP's unicode support is broken and strange...
        [color=blue][color=green]
        > > For example these texts are in TSCII encoding (for Tamil):
        > > Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯í¸Ç¡ø ÀÊì¸ ÓÊ󾡸, ¯í¸û ¯Ä¡Å¢Â¢ø ±ó¾
        > > Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
        > > ¾Á¢úô Àì¸í¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.[/color]
        >
        > No easy way to do it. The question is, are you trying to distinguish between
        > different possible ways of encoding Tamil or identify TSCII from all
        > possible encodings?[/color]

        I'll be interested to try both. Are you hinting that at least one is
        easier? Thanks.

        --
        | Just another PHP saint |
        Email: rrjanbiah-at-Y!com

        Comment

        • R. Rajesh Jeba Anbiah

          #5
          Re: Howto: Detect encodings?

          "Gerard van Wilgen" <gvanwilgen@pla net.nl> wrote in message news:<ca8eav$1p 9$1@reader08.wx s.nl>...[color=blue]
          > "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
          > news:abc4d8b8.0 406090618.bab78 e5@posting.goog le.com...[/color]
          <snip>[color=blue]
          >
          > A text that is not encoded in utf-8 will usually contain many byte sequences
          > that are invalid in utf-8. Encodings like TSCII are much more difficult to
          > detect, because every possible byte sequence would be valid (even though it
          > would not necessarily be a meaningful character sequence for a human
          > reader).
          >
          > When I have a text with an unknown encoding I simply load it in an editor
          > that supports many encodings, and then try them out until I have found the
          > setting that causes the text to become readable.[/color]

          Yes, I understand what you mean. Only human can identify it
          clearly...
          [color=blue]
          > Writing a script that can
          > detect the encoding is obviously very difficult.
          >
          > I should say, forget it. It is not worth the trouble.[/color]

          This <http://www.murasu.com/converter/> tool can auto-detect
          encoding. So, I think, still it is possible?

          --
          | Just another PHP saint |
          Email: rrjanbiah-at-Y!com

          Comment

          • Chung Leong

            #6
            Re: Howto: Detect encodings?


            "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
            news:abc4d8b8.0 406100019.61b48 ff7@posting.goo gle.com...[color=blue]
            > Thanks for the info/logic. Though I'm bit aware of unicode, this is
            > the first time I'm putting my hands on it... It's bit kinda pain as
            > PHP's unicode support is broken and strange...[/color]

            Yeah, Unicode support in PHP is practically non-existence. You can still get
            by though. More recent version of PHP supports character classes in regular
            expressions, so you can do things like
            /([\x{0900}-\x{09FF}]+)/.

            UTF8 is in general rather tricky to work with. For example, you can't limit
            the length of text entered by users using just the length attribute in HTML.
            And when database width constraint chops off some UTF8 text in
            mid-character, all sort of funky things happen in the browser.

            My advise is not to use Unicode unless you have to. I am not familiar with
            the Tamil script, but I think done a lot of work with Hindi. Most Hindi
            websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
            text requires rendering support from the operation system, which essentially
            limits you to Windows/IE only.
            [color=blue]
            > I'll be interested to try both. Are you hinting that at least one is
            > easier? Thanks.[/color]

            Choosing one encoding out of three is obviously easier than choosing one out
            of several hundred. As far as I know the only fool proof way is to run a
            spell check on the text. Statistical analysis could also work. Just count
            how often the letters are occurring and compare that to a known profile for
            that language.


            Comment

            • R. Rajesh Jeba Anbiah

              #7
              Re: Howto: Detect encodings?

              "Chung Leong" <chernyshevsky@ hotmail.com> wrote in message news:<Q-2dnYW2xMUIc1XdR Vn-ig@comcast.com> ...[color=blue]
              > "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
              > news:abc4d8b8.0 406100019.61b48 ff7@posting.goo gle.com...[color=green]
              > > Thanks for the info/logic. Though I'm bit aware of unicode, this is
              > > the first time I'm putting my hands on it... It's bit kinda pain as
              > > PHP's unicode support is broken and strange...[/color]
              >
              > Yeah, Unicode support in PHP is practically non-existence. You can still get
              > by though. More recent version of PHP supports character classes in regular
              > expressions, so you can do things like
              > /([\x{0900}-\x{09FF}]+)/.
              >
              > UTF8 is in general rather tricky to work with. For example, you can't limit
              > the length of text entered by users using just the length attribute in HTML.
              > And when database width constraint chops off some UTF8 text in
              > mid-character, all sort of funky things happen in the browser.[/color]

              Thanks a lot for your comments and help. As you said, utf8 acts
              much strange; if we include the utf8 texts from other files, it works
              differently than expected. Anyway we can somehow get it work.
              [color=blue]
              > My advise is not to use Unicode unless you have to. I am not familiar with
              > the Tamil script, but I think done a lot of work with Hindi. Most Hindi
              > websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
              > text requires rendering support from the operation system, which essentially
              > limits you to Windows/IE only.[/color]

              Yeah I understand. But, for Tamil staying behind Unicode may not
              help much as many people are moving towards it. The reason should be
              many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
              your work? :-)</OT>
              [color=blue][color=green]
              > > I'll be interested to try both. Are you hinting that at least one is
              > > easier? Thanks.[/color]
              >
              > Choosing one encoding out of three is obviously easier than choosing one out
              > of several hundred. As far as I know the only fool proof way is to run a
              > spell check on the text. Statistical analysis could also work. Just count
              > how often the letters are occurring and compare that to a known profile for
              > that language.[/color]

              In Tamil, some characters won't start a word (unless someone did
              a typo). I'd thought of using such grammar stuff, if there is no
              direct solution to detect encoding. Thanks a lot for your help.

              --
              | Just another PHP saint |
              Email: rrjanbiah-at-Y!com

              Comment

              • Chung Leong

                #8
                Re: Howto: Detect encodings?

                "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
                news:abc4d8b8.0 406110038.7d7b2 009@posting.goo gle.com...[color=blue]
                > Yeah I understand. But, for Tamil staying behind Unicode may not
                > help much as many people are moving towards it. The reason should be
                > many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
                > your work? :-)</OT>[/color]

                If that's the case then Unicode is definitely the preferred route. With
                Hindi, for some reason a lot of people are still using hack encoding. Just
                the other day I had to re-type a whole bunch of stuff and I don't know a
                word of Hindi. I don't work for webdunia.com, but my project make use of
                their content. Let me tell you converting their custom encoding into Unicode
                was quite a challenge.


                Comment

                • R. Rajesh Jeba Anbiah

                  #9
                  Re: Howto: Detect encodings?

                  "Chung Leong" <chernyshevsky@ hotmail.com> wrote in message news:<X9Gdnc9H9 98xqlfdRVn-iQ@comcast.com> ...[color=blue]
                  > "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@r ediffmail.com> wrote in message
                  > news:abc4d8b8.0 406110038.7d7b2 009@posting.goo gle.com...[color=green]
                  > > Yeah I understand. But, for Tamil staying behind Unicode may not
                  > > help much as many people are moving towards it. The reason should be
                  > > many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
                  > > your work? :-)</OT>[/color]
                  >
                  > If that's the case then Unicode is definitely the preferred route. With
                  > Hindi, for some reason a lot of people are still using hack encoding. Just
                  > the other day I had to re-type a whole bunch of stuff and I don't know a
                  > word of Hindi. I don't work for webdunia.com, but my project make use of
                  > their content. Let me tell you converting their custom encoding into Unicode
                  > was quite a challenge.[/color]

                  Thanks Chung for you comments and help. AFAIK, webdunia uses Dunia
                  encoding and here is Dunia to Unicode map
                  <http://crl.nmsu.edu/~mleisher/naicode.html> and a Perl script
                  <http://crl.nmsu.edu/~mleisher/nai2ucs.pl> (incase, if you want).
                  Thanks.

                  --
                  | Just another PHP saint |
                  Email: rrjanbiah-at-Y!com

                  Comment

                  Working...