Chinese character detection

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Wassy

    Chinese character detection

    Hi, i have a website which contains both chinese and english content
    which is stored in a database. Each record in the dB has an english
    and Chinese field. If a user enters a search string i have to be able
    to detect which characters are latin based and which are chinese
    ideographs.

    eg) a user may enter "hello ÐÂÎÅÍø world"

    this is because many Chinese search phrases (especially those involved
    with technology may include English words or acronyms) eg) I think MP3
    in Chinese is MPÈý as MP is an English acronym with the number 3 after
    it, which in chinese is Èý (i may be wrong, my written Chinese is non-
    existent :-) but that's just an example)

    to make an effective search on the Chinese field I cannot just put
    latin characters through the same search process as it would detract
    from the effectiveness of the search.

    What I need, from the search string (hello ÐÂÎÅÍø world) is a PHP
    function that will give me an array telling me if each character in
    the string is Chinese or not (i do not need to know if it is
    punctuation symbols or any other characters, just yes Chinese or no
    something else)

    all of my dB fields are UTF-8, i looked at finding out the range of
    Han characters in UTF-8 encoding but its seems very complicated. If
    anyone can help out id appreciate it.

    Regards

    Simon
  • =?ISO-8859-13?Q?Kristaps_K=FBlis?=

    #2
    Re: Chinese character detection

    On Oct 15, 11:45 pm, Wassy <si...@wass1.en tadsl.comwrote:
    Hi, i have a website which contains both chinese and english content
    which is stored in a database. Each record in the dB has an english
    and Chinese field. If a user enters a search string i have to be able
    to detect which characters are latin based and which are chinese
    ideographs.
    >
    eg) a user may enter "hello ÐÂÎÅÍø world"
    >
    this is because many Chinese search phrases (especially those involved
    with technology may include English words or acronyms) eg) I think MP3
    in Chinese is MPÈý as MP is an English acronym with the number 3 after
    it, which in chinese is Èý (i may be wrong, my written Chinese is non-
    existent :-) but that's just an example)
    >
    to make an effective search on the Chinese field I cannot just put
    latin characters through the same search process as it would detract
    from the effectiveness of the search.
    >
    What I need, from the search string (hello ÐÂÎÅÍø world) is aPHP
    function that will give me an array telling me if each character in
    the string is Chinese or not (i do not need to know if it is
    punctuation symbols or any other characters, just yes Chinese or no
    something else)
    >
    all of my dB fields are UTF-8, i looked at finding out the range of
    Han characters in UTF-8 encoding but its seems very complicated. If
    anyone can help out id appreciate it.
    >
    Regards
    >
    Simon
    Something like this:
    function is_non_ascii($s tr){
    $length = mb_strlen($str) ;
    for($i = 0; $i < $length; ++$i){
    $char = mb_substr($str, $i, 1);
    if($char <= 0x7F)
    return true;
    }
    return false;
    }

    Comment

    • =?ISO-8859-1?Q?=22=C1lvaro_G=2E_Vicario=22?=

      #3
      Re: Chinese character detection

      Wassy escribió:
      Hi, i have a website which contains both chinese and english content
      which is stored in a database. Each record in the dB has an english
      and Chinese field. If a user enters a search string i have to be able
      to detect which characters are latin based and which are chinese
      ideographs.
      Very dirty tricks I can think of:

      1. Convert the input to a non-chinese charset and compare it back with
      the original. If they're equal, it's possibly English. You may use
      utf8_decode() or iconv().

      2. Compare the string length using a unicode-aware function and a
      byte-only function. If they're equal, it's a single-byte string and it's
      possibly English. Try strlen() and mb_strlen().

      3. I found this in Google Code Search [1], it's from a piece of software
      called Mushu:

      function is_chinese($str ) {
      return ereg("^[" . chr(0xa1) . "-" . chr(0xff) . "]+$", $str);
      }


      [1] http://www.google.com/codesearch


      --
      -- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
      -- Mi sitio sobre programación web: http://bits.demogracia.com
      -- Mi web de humor al baño María: http://www.demogracia.com
      --

      Comment

      Working...