MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jrs_14618@yahoo.com

    MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

    Hello All,

    This post is essentially a reply a previous post/thread
    here on this mailing.databas e.myodbc group titled:

    MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

    [This version has a couple subtle edits from the orginial I posted
    on mailing.databas e.myodbc - I'm cross posting here on this
    topic/subject related newsgroup]

    I was wondering if anybody has experienced the same issues
    challenges I'm experiencing I'll describe shortly. Once
    resolved some fascinating and powerful multi-lingual
    apps incorporating non-English/latin character sets can be
    realized by many developers.

    I have a Unicode utf8 English - Arabic - Hebrew - Greek (and
    several other languages) database in Microsoft Excel. I KNOW
    that it is Unicode utf8 data because MySQL tells me it
    recognizes the encoding as such but not in the context I want.

    Allow me to explain ...

    I can search the Unicode utf8 encoding with no problem in
    Excel. While in Excel I highlight a complete word or a
    partial string of an Arabic word copy it to the clipboard
    (i.e. memory). I then do a find and the process is the
    same successful result as if it was an English string.

    MySQL 5.0 is supposed to handle Unicode utf8

    I created a MySQL database I named: languages

    CREATE DATABASE languages ;

    and I implemented the following command on a MySQL
    command prompt:

    ALTER DATABASE languages DEFAULT CHARACTER SET utf8;

    No problem (so far) MySQL seemingly recognized utf8 and
    accepted it. My understanding is with the ALTER command
    the tables I create against languages will be utf8.

    I now created a table I named mainlang which denotes it
    will be the main table for my languages.

    mysql>CREATE TABLE mainlang
    ->(
    ->langNumID varchar(30),
    ->colB varchar(30),
    ->colC varchar(30),
    ->primary key (langNumID, colB)
    ->);

    Again so far no problem: Table successfully created.
    My third column 'colC' is where the Unicode data
    will be stored.

    I now attempt to import the database from my
    Excel file into my MySQL database as follows:

    mysql>load data infile 'c:\\arabicdict ionary.csv'
    ->into table mainlang
    ->fields terminated by ','
    ->lines terminated by '\n'
    ->(langNumID, colB, colC);
    ERROR 1406 (22001): Data too long for 'colC' at row 1

    So what to do? I did a search and found other
    people seemingly had the same problem and someone
    suggested:

    ALTER DATABASE languages DEFAULT CHARACTER SET cp1250;

    I dropped mainlang, recreated it, redid the load and
    Lo and behold ... it seemed to work. No Data too long
    error occurred and when I did the following query:

    mysql>select langNumID, colB, colC
    ->from mainlang
    ->where colB = '4994';

    I see colA have a correct numeric value, colB a
    correct numeric value (4994) and for colC a string of
    unintelligible characters with diacritical marks,
    oomlats etc. which I know is the cp1250 encoding
    interpretation of the Unicode utf8 data which is
    similarly unintelligible in its own regard.

    Now what I try is: do a copy of the obscure colC
    cp1250 character string into the clipboard/memory
    and then do the following tweak on the original
    select statement to see if I can search on the
    (now) cp1250 character string:

    mysql>select langNumID, colB, colC
    ->from mainlang
    ->where colc = 'paste of the cp1250 character string';

    The computer would not allow a paste unless I pressed
    the escape key. On initiating this select command
    I got an empty set (no match)

    My questions are:

    Has anyone been successful creating a Unicode utf8
    MySQL database that accepts Arabic?

    If yes, how did you get around or not encounter the
    Data too long issue?

    Have you tried the cp1250 (or cp1251 - same mechanics
    same results) work around as I have? Are you
    able to search the cp1250 character string (my colC)?
    If yes, how did you successfully manage to do it?

    Lastly, if I take the cp1250 encoded string and paste
    it into Excel ... I can string search the cp1250
    encoding with no problem.

    Also, here's how I know my Unicode utf-8 data is
    correct apart from my own manual cross-referencing
    and being recognized by MySQL in some respect:

    When I copy the Unicode utf8 encoding and try to
    paste it into the select command to see what would
    happen I get the following error:

    ERROR 1257 (HY000): Illegal mix of collations
    (cp1250_general _ci, IMPLICIT) and
    (utf8_general_c i, COERCIBLE) for operation '='

    So what I have here is a situation where MySQL
    is recognizing Unicode utf8 encoding but not
    from the respect of packing a table!

    Go Figure ...

    Anyone wishing to share any insight or solution would
    be GREATLY appeciated. I promise if I find a solution
    I will share it.

    Thank you Very Much, Shukran Jiddan, Todah Rabah,
    Muchos Gracias ...

    Joel S
    (585) 255-0997
    jrs_14618 at yahoo.com

  • Jerry Stuckle

    #2
    Re: MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

    jrs_14618@yahoo .com wrote:[color=blue]
    > Hello All,
    >
    > This post is essentially a reply a previous post/thread
    > here on this mailing.databas e.myodbc group titled:
    >
    > MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
    >
    > [This version has a couple subtle edits from the orginial I posted
    > on mailing.databas e.myodbc - I'm cross posting here on this
    > topic/subject related newsgroup]
    >
    > I was wondering if anybody has experienced the same issues
    > challenges I'm experiencing I'll describe shortly. Once
    > resolved some fascinating and powerful multi-lingual
    > apps incorporating non-English/latin character sets can be
    > realized by many developers.
    >
    > I have a Unicode utf8 English - Arabic - Hebrew - Greek (and
    > several other languages) database in Microsoft Excel. I KNOW
    > that it is Unicode utf8 data because MySQL tells me it
    > recognizes the encoding as such but not in the context I want.
    >
    > Allow me to explain ...
    >
    > I can search the Unicode utf8 encoding with no problem in
    > Excel. While in Excel I highlight a complete word or a
    > partial string of an Arabic word copy it to the clipboard
    > (i.e. memory). I then do a find and the process is the
    > same successful result as if it was an English string.
    >
    > MySQL 5.0 is supposed to handle Unicode utf8
    >
    > I created a MySQL database I named: languages
    >
    > CREATE DATABASE languages ;
    >
    > and I implemented the following command on a MySQL
    > command prompt:
    >
    > ALTER DATABASE languages DEFAULT CHARACTER SET utf8;
    >
    > No problem (so far) MySQL seemingly recognized utf8 and
    > accepted it. My understanding is with the ALTER command
    > the tables I create against languages will be utf8.
    >
    > I now created a table I named mainlang which denotes it
    > will be the main table for my languages.
    >
    > mysql>CREATE TABLE mainlang
    > ->(
    > ->langNumID varchar(30),
    > ->colB varchar(30),
    > ->colC varchar(30),
    > ->primary key (langNumID, colB)
    > ->);
    >
    > Again so far no problem: Table successfully created.
    > My third column 'colC' is where the Unicode data
    > will be stored.
    >
    > I now attempt to import the database from my
    > Excel file into my MySQL database as follows:
    >
    > mysql>load data infile 'c:\\arabicdict ionary.csv'
    > ->into table mainlang
    > ->fields terminated by ','
    > ->lines terminated by '\n'
    > ->(langNumID, colB, colC);
    > ERROR 1406 (22001): Data too long for 'colC' at row 1
    >
    > So what to do? I did a search and found other
    > people seemingly had the same problem and someone
    > suggested:
    >
    > ALTER DATABASE languages DEFAULT CHARACTER SET cp1250;
    >
    > I dropped mainlang, recreated it, redid the load and
    > Lo and behold ... it seemed to work. No Data too long
    > error occurred and when I did the following query:
    >
    > mysql>select langNumID, colB, colC
    > ->from mainlang
    > ->where colB = '4994';
    >
    > I see colA have a correct numeric value, colB a
    > correct numeric value (4994) and for colC a string of
    > unintelligible characters with diacritical marks,
    > oomlats etc. which I know is the cp1250 encoding
    > interpretation of the Unicode utf8 data which is
    > similarly unintelligible in its own regard.
    >
    > Now what I try is: do a copy of the obscure colC
    > cp1250 character string into the clipboard/memory
    > and then do the following tweak on the original
    > select statement to see if I can search on the
    > (now) cp1250 character string:
    >
    > mysql>select langNumID, colB, colC
    > ->from mainlang
    > ->where colc = 'paste of the cp1250 character string';
    >
    > The computer would not allow a paste unless I pressed
    > the escape key. On initiating this select command
    > I got an empty set (no match)
    >
    > My questions are:
    >
    > Has anyone been successful creating a Unicode utf8
    > MySQL database that accepts Arabic?
    >
    > If yes, how did you get around or not encounter the
    > Data too long issue?
    >
    > Have you tried the cp1250 (or cp1251 - same mechanics
    > same results) work around as I have? Are you
    > able to search the cp1250 character string (my colC)?
    > If yes, how did you successfully manage to do it?
    >
    > Lastly, if I take the cp1250 encoded string and paste
    > it into Excel ... I can string search the cp1250
    > encoding with no problem.
    >
    > Also, here's how I know my Unicode utf-8 data is
    > correct apart from my own manual cross-referencing
    > and being recognized by MySQL in some respect:
    >
    > When I copy the Unicode utf8 encoding and try to
    > paste it into the select command to see what would
    > happen I get the following error:
    >
    > ERROR 1257 (HY000): Illegal mix of collations
    > (cp1250_general _ci, IMPLICIT) and
    > (utf8_general_c i, COERCIBLE) for operation '='
    >
    > So what I have here is a situation where MySQL
    > is recognizing Unicode utf8 encoding but not
    > from the respect of packing a table!
    >
    > Go Figure ...
    >
    > Anyone wishing to share any insight or solution would
    > be GREATLY appeciated. I promise if I find a solution
    > I will share it.
    >
    > Thank you Very Much, Shukran Jiddan, Todah Rabah,
    > Muchos Gracias ...
    >
    > Joel S
    > (585) 255-0997
    > jrs_14618 at yahoo.com
    >[/color]

    No idea, Joel. Why don't you try asking in a mysql database newsgroup - such as
    comp.databases. mysql. This newsgroup is for PHP programming.

    --
    =============== ===
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    jstucklex@attgl obal.net
    =============== ===

    Comment

    Working...