checking to see if a character is UTF8

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • lkrubner@geocities.com

    checking to see if a character is UTF8


    this is a function that someone has up on www.php.net:


    function seemsUTF8($Str) {
    // bmorel at ssi dot fr
    //17-Feb-2004 01:22
    //Here is an improved version of that function, compatible with 31-bit
    encoding scheme of //Unicode //3.x :
    for ($i=0; $i < strlen($Str); $i++) {
    if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
    elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
    elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
    elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
    elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
    elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
    else return false; # Does not match any model
    for ($j=0; $j < $n; $j++) {
    # n bytes matching 10bbbbbb follow ?
    if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
    return false;
    }
    }
    return true;
    }



    What is achieved by the variable $n? I don't know enough about
    character codes to understand what that final inner for loop is trying
    to do.

  • Malcolm Dew-Jones

    #2
    Re: checking to see if a character is UTF8

    lkrubner@geocit ies.com wrote:

    : this is a function that someone has up on www.php.net:


    : function seemsUTF8($Str) {
    : // bmorel at ssi dot fr
    : //17-Feb-2004 01:22
    : //Here is an improved version of that function, compatible with 31-bit
    : encoding scheme of //Unicode //3.x :
    : for ($i=0; $i < strlen($Str); $i++) {
    : if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
    : elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
    : elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
    : elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
    : elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
    : elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
    : else return false; # Does not match any model
    : for ($j=0; $j < $n; $j++) {
    : # n bytes matching 10bbbbbb follow ?
    : if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
    : return false;
    : }
    : }
    : return true;
    : }



    : What is achieved by the variable $n? I don't know enough about
    : character codes to understand what that final inner for loop is trying
    : to do.

    A utf-8 character can take more than one byte. Characters that are larger
    (in numeric value) than 127 require more than one byte. The first byte of
    a multibyte character indicates how many bytes are in the character.

    There can be from two to six bytes in total (the first byte followed by 1
    to 5 more bytes).

    The outer loop is looking for the first byte of a multibyte character.
    When it finds one then it examines the bit pattern to see how many more
    bytes there are.

    The inner loop is examining those bytes (the "more" in the above
    sentence). It is checking that there are the correct number of
    continuation bytes following the first byte.

    The outer loop skips over bytes that represent single byte characters.

    --

    This programmer available for rent.

    Comment

    Working...