how to test text to see if maybe it is UTF-8????

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • lawrence

    how to test text to see if maybe it is UTF-8????

    Someone on www.php.net suggested using a seems_utf8() method to test
    text for UTF-8 character encoding but didn't specify how to write such
    a method. Can anyone suggest a test that might work? Something that
    maybe gives 90% confidence that a given block of text is or is not
    UTF-8 encoded?
  • Simon Stienen

    #2
    Re: how to test text to see if maybe it is UTF-8????

    lawrence <lkrubner@geoci ties.com> wrote:[color=blue]
    > Someone on www.php.net suggested using a seems_utf8() method to test
    > text for UTF-8 character encoding but didn't specify how to write such
    > a method. Can anyone suggest a test that might work? Something that
    > maybe gives 90% confidence that a given block of text is or is not
    > UTF-8 encoded?[/color]

    You may be able to decide, that a given string is *not* UTF-8, but there is
    no way to clearly decide that the string *is* UTF-8. Therefore,
    "seems_utf8 " is a good name for such a function.

    How validation is done:
    Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
    whether you define this text as UTF-8 or any ISO encoding, since the first
    128 characters all have the same bit sequence in these encodings.
    However, if there actually *are* characters with a value of 128 or higher,
    check, whether the given sequence would be a valid UTF-8 sequence (see
    UTF-8 in Wikipedia for this). If this and every other sequence is valid
    UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
    of extended ASCII/ANSI characters, too. It's impossible to be sure about
    that.

    HTH
    Simon

    (Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
    files, for transmission UTF-8 is just fine because most characters won't
    have an extra byte. In most UTF-16 encoded documents, you can be pretty
    sure about the encoding due to the enormous percentage of 0x00 to 0x0F. In
    almost every text you get a percentage of at least 33% of these characters,
    since every character in US-ASCII and Latin 1 has a preceding 0x00, every
    character in Latin Extended A and B is preceded by 0x01, and so on.

    0x0000 to 0x0fff contains:
    Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, IPA
    Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek
    and Coptic, Cyrillic, Cyrillic Supplement, Armenian, Hebrew, Arabic,
    Syriac, Thaana, Devanagari and Bengali, Gurmukhi, Gujarati, Oriya, Tamil,
    Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan
    I guess this list should cover most documents.)
    --
    Simon Stienen <http://dangerouscat.ne t> <http://slashlife.de>
    »What you do in this world is a matter of no consequence,
    The question is, what can you make people believe that you have done.«
    -- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle

    Comment

    • lawrence

      #3
      Re: how to test text to see if maybe it is UTF-8????

      Simon Stienen <simon.stienen@ news.slashlife. de> wrote in message news:<1wi5p87hn 70gq$.dlg@news. dangerouscat.ne t>...[color=blue]
      > lawrence <lkrubner@geoci ties.com> wrote:[color=green]
      > > Someone on www.php.net suggested using a seems_utf8() method to test
      > > text for UTF-8 character encoding but didn't specify how to write such
      > > a method. Can anyone suggest a test that might work? Something that
      > > maybe gives 90% confidence that a given block of text is or is not
      > > UTF-8 encoded?[/color]
      >
      > You may be able to decide, that a given string is *not* UTF-8, but there is
      > no way to clearly decide that the string *is* UTF-8. Therefore,
      > "seems_utf8 " is a good name for such a function.[/color]

      This is very good information. Thanks. It certainly points the right
      way. But how does one get the value of the characters? Using ord()???



      [color=blue]
      > Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
      > whether you define this text as UTF-8 or any ISO encoding, since the first
      > 128 characters all have the same bit sequence in these encodings.
      > However, if there actually *are* characters with a value of 128 or higher,
      > check, whether the given sequence would be a valid UTF-8 sequence (see
      > UTF-8 in Wikipedia for this). If this and every other sequence is valid
      > UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
      > of extended ASCII/ANSI characters, too. It's impossible to be sure about
      > that.[/color]

      Take the string and move it through one character at a time, perhaps
      in a for() loop, and get the byte value of each character using ord()?

      The page for ord() says ord() "Return ASCII value of character" so if
      a character is non-ASCII, perhaps it doesn't work? What PHP function
      do I use to get the hex or dec value for a character?





      [color=blue]
      > (Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
      > files, for transmission UTF-8 is just fine because most characters won't
      > have an extra byte. In most UTF-16 encoded documents, you can be pretty
      > sure about the encoding due to the enormous percentage of 0x00 to 0x0F. In
      > almost every text you get a percentage of at least 33% of these characters,
      > since every character in US-ASCII and Latin 1 has a preceding 0x00, every
      > character in Latin Extended A and B is preceded by 0x01, and so on.[/color]

      I have the impression that UTF-16 or 32 is a bad idea in a web
      context. Some good reasons were posted here:



      The whole thread was informative.

      Comment

      • Simon Stienen

        #4
        Re: how to test text to see if maybe it is UTF-8????

        lawrence <lkrubner@geoci ties.com> wrote:[color=blue]
        > Simon Stienen <simon.stienen@ news.slashlife. de> wrote in message news:<1wi5p87hn 70gq$.dlg@news. dangerouscat.ne t>...[color=green]
        >> lawrence <lkrubner@geoci ties.com> wrote:[color=darkred]
        >>> Someone on www.php.net suggested using a seems_utf8() method to test
        >>> text for UTF-8 character encoding but didn't specify how to write such
        >>> a method. Can anyone suggest a test that might work? Something that
        >>> maybe gives 90% confidence that a given block of text is or is not
        >>> UTF-8 encoded?[/color]
        >>
        >> You may be able to decide, that a given string is *not* UTF-8, but there is
        >> no way to clearly decide that the string *is* UTF-8. Therefore,
        >> "seems_utf8 " is a good name for such a function.[/color]
        >
        > This is very good information. Thanks. It certainly points the right
        > way. But how does one get the value of the characters? Using ord()???[/color]

        ord() will give you the value of the given single byte character, that is
        0..255. In UTF-8, every character which has a higher value than 127 (0x7f)
        is represented using at least two bytes:

        <http://en.wikipedia.or g/wiki/Utf-8>
        | Code range (hex) | UTF-8 (binary)
        | 000000 - 00007F | 0xxxxxxx
        | 000080 - 0007FF | 110xxxxx 10xxxxxx
        | 000800 - 00FFFF | 1110xxxx 10xxxxxx 10xxxxxx
        | 010000 - 10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

        It also states:
        | [...] number of unused bytes in a UTF-8 stream increased to 13 bytes:
        | 0xC0, 0xC1, 0xF5-0xFF

        Therefore, you have to find the first byte with a value of 0x80 or greater.
        Either checking against ord():
        1) if (ord($string{$i })>=0x80) ...
        2) if (ord($string{$i })&0x80) ...
        Or using a regular expression:
        3) /[\x80-\xFF]/ (Get the offset when using preg_match)

        Then check, whether the byte may occur in UTF-8 encoided text. If it
        doesn't match any in the list 0xC0, 0xC1, 0xF5-0xFF, it may occur. (You
        might want to do this check before finding the first byte >=0x80, using a
        regexp or repeated substr_count.)

        If it may occur in an UTF-8 encoded string this does not imply that it may
        occur at *this* position. If ord($byte)&0xc0 (the two uppermost bits) is
        0xC0, it is a byte, which has to be in the middle of a unicode character
        sequence. Therefore, if we find such a character here, the string is not
        valid UTF-8.
        Otherwise, count how many of the highest significant bits are set.
        Substract one. This is the number of bytes following in this UTF-8
        character. Each of the following bytes has to validate: $byte&0xC0==0xC 0.
        If so, this is a valid UTF-8 encoded character.

        Find the next byte >=0x80 and continue checking until you either find an
        invalid value (seems_utf8 -> false) or reach the end of the string
        (seems_utf8 -> true).
        [color=blue]
        > I have the impression that UTF-16 or 32 is a bad idea in a web
        > context. [...][/color]

        As I explicitly mentioned:
        | (Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
        | files, for transmission UTF-8 is just fine [...])
        --
        Simon Stienen <http://dangerouscat.ne t> <http://slashlife.de>
        »What you do in this world is a matter of no consequence,
        The question is, what can you make people believe that you have done.«
        -- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle

        Comment

        • lawrence

          #5
          Re: how to test text to see if maybe it is UTF-8????

          Simon Stienen <simon.stienen@ news.slashlife. de> wrote in message news:<1wi5p87hn 70gq$.dlg@news. dangerouscat.ne t>...[color=blue]
          > How validation is done:
          > Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
          > whether you define this text as UTF-8 or any ISO encoding, since the first
          > 128 characters all have the same bit sequence in these encodings.
          > However, if there actually *are* characters with a value of 128 or higher,
          > check, whether the given sequence would be a valid UTF-8 sequence (see
          > UTF-8 in Wikipedia for this). If this and every other sequence is valid
          > UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
          > of extended ASCII/ANSI characters, too. It's impossible to be sure about
          > that.[/color]

          is there a way to figure out how many bytes a character has and the
          value of each of those bytes?

          Comment

          • Andy Hassall

            #6
            Re: how to test text to see if maybe it is UTF-8????

            On 1 Oct 2004 01:12:35 -0700, lkrubner@geocit ies.com (lawrence) wrote:
            [color=blue]
            >Simon Stienen <simon.stienen@ news.slashlife. de> wrote in message news:<1wi5p87hn 70gq$.dlg@news. dangerouscat.ne t>...[color=green]
            >> How validation is done:
            >> Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
            >> whether you define this text as UTF-8 or any ISO encoding, since the first
            >> 128 characters all have the same bit sequence in these encodings.
            >> However, if there actually *are* characters with a value of 128 or higher,
            >> check, whether the given sequence would be a valid UTF-8 sequence (see
            >> UTF-8 in Wikipedia for this). If this and every other sequence is valid
            >> UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
            >> of extended ASCII/ANSI characters, too. It's impossible to be sure about
            >> that.[/color]
            >
            >is there a way to figure out how many bytes a character has and the
            >value of each of those bytes?[/color]

            Here's my attempt at a function to determine if something is /not/ UTF-8.

            <?php
            function invalidUTF8($st r)
            {
            $charSize = 0;
            for ($i = 0; $i < strlen($str); $i++)
            {
            $o = ord($str{$i});
            if ($charSize == 0)
            { // must be a lead byte or a single byte character
            if ($o <= 127) // single byte character
            continue;
            elseif (($o & 0xc0) == 0x80) // lead byte for 2 byte char
            $charSize = 1;
            elseif (($o & 0xe0) == 0xc0) // lead byte for 3 byte char
            $charSize = 2;
            elseif (($o & 0xf0) == 0xe0) // lead byte for 4 byte char
            $charSize = 3;
            else
            {
            trigger_error(
            sprintf("Malfor med lead byte %08b at position %d",
            $o, $i)
            );
            return true;
            }
            }
            elseif (($o & 0xC0) == 0x80) // trail byte
            {
            $charSize--;
            }
            else
            {
            trigger_error(
            sprintf("Malfor med trail byte %08b at position %d",
            $o, $i)
            );
            return true;
            }
            }
            return false;
            }

            var_dump(invali dUTF8("this is plain ASCII"));
            print "<hr>";

            // UTF-8 encoding of the Euro currency symbol
            var_dump(invali dUTF8(chr(226). chr(130).chr(17 2)));
            print "<hr>";

            // invalid UTF-8
            var_dump(invali dUTF8("xxxx" . chr(254)));
            ?>

            --
            Andy Hassall / <andy@andyh.co. uk> / <http://www.andyh.co.uk >
            <http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

            Comment

            • R. Rajesh Jeba Anbiah

              #7
              Re: how to test text to see if maybe it is UTF-8????

              Andy Hassall <andy@andyh.co. uk> wrote in message news:<6lgrl0dcd 8uarjaed8jbip7q fapqqvcbru@4ax. com>...
              <snip>[color=blue]
              > Here's my attempt at a function to determine if something is /not/ UTF-8.
              >
              > <?php
              > function invalidUTF8($st r)
              > {
              > $charSize = 0;
              > for ($i = 0; $i < strlen($str); $i++)
              > {
              > $o = ord($str{$i});
              > if ($charSize == 0)
              > { // must be a lead byte or a single byte character
              > if ($o <= 127) // single byte character
              > continue;
              > elseif (($o & 0xc0) == 0x80) // lead byte for 2 byte char
              > $charSize = 1;
              > elseif (($o & 0xe0) == 0xc0) // lead byte for 3 byte char
              > $charSize = 2;
              > elseif (($o & 0xf0) == 0xe0) // lead byte for 4 byte char
              > $charSize = 3;
              > else
              > {
              > trigger_error(
              > sprintf("Malfor med lead byte %08b at position %d",
              > $o, $i)
              > );
              > return true;
              > }
              > }
              > elseif (($o & 0xC0) == 0x80) // trail byte
              > {
              > $charSize--;
              > }
              > else
              > {
              > trigger_error(
              > sprintf("Malfor med trail byte %08b at position %d",
              > $o, $i)
              > );
              > return true;
              > }
              > }
              > return false;
              > }
              >
              > var_dump(invali dUTF8("this is plain ASCII"));
              > print "<hr>";
              >
              > // UTF-8 encoding of the Euro currency symbol
              > var_dump(invali dUTF8(chr(226). chr(130).chr(17 2)));
              > print "<hr>";
              >
              > // invalid UTF-8
              > var_dump(invali dUTF8("xxxx" . chr(254)));
              > ?>[/color]



              --
              | Just another PHP saint |
              Email: rrjanbiah-at-Y!com

              Comment

              • Chung Leong

                #8
                Re: how to test text to see if maybe it is UTF-8????

                "Andy Hassall" <andy@andyh.co. uk> wrote in message
                news:6lgrl0dcd8 uarjaed8jbip7qf apqqvcbru@4ax.c om...[color=blue]
                > On 1 Oct 2004 01:12:35 -0700, lkrubner@geocit ies.com (lawrence) wrote:
                >[color=green]
                > >Simon Stienen <simon.stienen@ news.slashlife. de> wrote in message[/color][/color]
                news:<1wi5p87hn 70gq$.dlg@news. dangerouscat.ne t>...[color=blue][color=green][color=darkred]
                > >> How validation is done:
                > >> Take the string. If there is no character 0x80 to 0xFF, it doesn't[/color][/color][/color]
                matter,[color=blue][color=green][color=darkred]
                > >> whether you define this text as UTF-8 or any ISO encoding, since the[/color][/color][/color]
                first[color=blue][color=green][color=darkred]
                > >> 128 characters all have the same bit sequence in these encodings.
                > >> However, if there actually *are* characters with a value of 128 or[/color][/color][/color]
                higher,[color=blue][color=green][color=darkred]
                > >> check, whether the given sequence would be a valid UTF-8 sequence (see
                > >> UTF-8 in Wikipedia for this). If this and every other sequence is valid
                > >> UTF-8, the string itself *might* be UTF-8. Of course it could be a[/color][/color][/color]
                sequence[color=blue][color=green][color=darkred]
                > >> of extended ASCII/ANSI characters, too. It's impossible to be sure[/color][/color][/color]
                about[color=blue][color=green][color=darkred]
                > >> that.[/color]
                > >
                > >is there a way to figure out how many bytes a character has and the
                > >value of each of those bytes?[/color]
                >
                > Here's my attempt at a function to determine if something is /not/ UTF-8.
                >
                > <?php
                > function invalidUTF8($st r)
                > {
                > ...
                > }[/color]

                Ehh, there's, like, this thing call regular expression :-)

                function IsUTF8($s) {
                $s = "$s ";
                return !preg_match('/[\xF0-\xFF]/', $s) &&
                !preg_match('/[\xC0-\xDF][^\x80-\xBF]/', $s) &&
                !preg_match('/[\xE0-\xEF][^\x80-\xBF][^\x80-\xBF]/', $s);
                }


                Comment

                • Simon Stienen

                  #9
                  Re: how to test text to see if maybe it is UTF-8????

                  Chung Leong <chernyshevsky@ hotmail.com> wrote:[color=blue]
                  > Ehh, there's, like, this thing call regular expression :-)
                  >
                  > function IsUTF8($s) {
                  > $s = "$s ";
                  > return !preg_match('/[\xF0-\xFF]/', $s) &&
                  > !preg_match('/[\xC0-\xDF][^\x80-\xBF]/', $s) &&
                  > !preg_match('/[\xE0-\xEF][^\x80-\xBF][^\x80-\xBF]/', $s);
                  > }[/color]

                  How about:
                  E0 00 80?
                  E0 <END>?
                  C0 <END>?
                  80 81 82?
                  Invalid UTF-8, but your function would return true for them.

                  If you want a RegExp:
                  $isinvalidutf8 = preg_match('=([\xF0-\xFF]|'.
                  '[\xC0-\xDF]([^\x80-\xBF]|$)|'.
                  '[\xE0-\xEF][\x00-\xFF]($|[^\x80-\xBF])|'.
                  '[\xE0-\xEF]($|[^\x80-\xBF][\x00-\xFF])|'.
                  '(^|[^\x80-\xBF])[\x80-\xBF])=', $string);
                  (untested!)

                  Btw.: The opposite of "All birds are able to fly." is "There is at least
                  one bird which can't fly.", *not* "No bird is able to fly."
                  Also, the opposite of "isInvalidU TF8" is "mightBeValidUT F8", not
                  "isValidUTF 8". Therefore the name you chose for your function is wrong.
                  --
                  Simon Stienen <http://dangerouscat.ne t> <http://slashlife.de>
                  »What you do in this world is a matter of no consequence,
                  The question is, what can you make people believe that you have done.«
                  -- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle

                  Comment

                  • lawrence

                    #10
                    Re: how to test text to see if maybe it is UTF-8????

                    Andy Hassall <andy@andyh.co. uk> wrote in message news:<6lgrl0dcd 8uarjaed8jbip7q fapqqvcbru@4ax. com>...[color=blue]
                    > On 1 Oct 2004 01:12:35 -0700, lkrubner@geocit ies.com (lawrence) wrote:
                    >[color=green]
                    > >Simon Stienen <simon.stienen@ news.slashlife. de> wrote in message news:<1wi5p87hn 70gq$.dlg@news. dangerouscat.ne t>...[color=darkred]
                    > >> How validation is done:
                    > >> Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
                    > >> whether you define this text as UTF-8 or any ISO encoding, since the first
                    > >> 128 characters all have the same bit sequence in these encodings.
                    > >> However, if there actually *are* characters with a value of 128 or higher,
                    > >> check, whether the given sequence would be a valid UTF-8 sequence (see
                    > >> UTF-8 in Wikipedia for this). If this and every other sequence is valid
                    > >> UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
                    > >> of extended ASCII/ANSI characters, too. It's impossible to be sure about
                    > >> that.[/color]
                    > >
                    > >is there a way to figure out how many bytes a character has and the
                    > >value of each of those bytes?[/color]
                    >
                    > Here's my attempt at a function to determine if something is /not/ UTF-8.[/color]

                    Thanks much for the code. I followed the other link to www.php.net
                    where someone had posted there seems_UTF8() function. Your function
                    and there's combined should offer a high level of confidence about
                    whether something is UTF-8.

                    Comment

                    Working...