How to read unicode (utf-8) / binary file line by line

**johny10151981** · Jan 22 '10, 03:56 PM

Hello,
few things.
1. UNICODE and utf-8 is not same(if i am not wrong). UNICODE is 2 byte long. On the other hand UTF-8 is a multybyte encoding system.

2. (dont listen to me). Looking for a easy way. not a good idea :)

Best Regrads,
JOHNY

**RedSon** · Jan 22 '10, 04:37 PM

Instead of using fgets and strcmp you are going to want to use the wide character version of those methods.

You will have to read the documentation of the OS/libraries you are using to find out what the wide char variants are.

If you are using Windows a quick search on MSDN should be helpful. Also you can do conversions from one to the other.

**freeseif** · Jan 22 '10, 09:15 PM

@JOHNY

Yes, you are right, i want edit title to remove "unicode" but no permission ^_^
if you have a UTF-8 project, and you want to read UTF-8 file line by line, what is the easy way you use ? =)

**freeseif** · Jan 22 '10, 09:17 PM

@RedSon

6 months of searching in Books, MSDN, Documentations, Internet, Forums.. i never found a solution to read UTF-8 file in C99!, can you help me please ? =)

**RedSon** · Jan 22 '10, 09:19 PM

Did you read the Unicode and Character Set functions on MSDN?

Unicode and Character Set Functions - Win32 apps

http://msdn.microsoft.com/en-us/library/dd374085(VS.85).aspx

The following functions are used with character sets.

**RedSon** · Jan 22 '10, 09:22 PM

Oh wait, if you are using gcc then you are not on a windows machine, so that MSDN link is not going to do you any good. I don't know why you are even searching MSDN like you state in post #5.

That is why I suggested that you search your libraries and other documentation for wide string functions. Your header files that come with C99 should have something for that.

**freeseif** · Jan 22 '10, 09:35 PM

@RedSon

First Thank you, i already read all MSDN pages that talking about UTF-8 ^_^, but i think i need use MultiByteToWide Char() after reading string from UTF-8 file, but i don't know how to use exactly!

**freeseif** · Jan 22 '10, 09:37 PM

@RedSon

Yes, i m looking for a solution in C99 with GCC, i think i need read the UTF-8 file in binary mode and convert UTF-8 to UTF-16 or not! or other way.. i need help seriously =)

**RedSon** · Jan 22 '10, 09:38 PM

Like I said, you won't be able to use it, because you are not building a windows application using windows libraries.

You will need to find an appropriate library call in your headers.

**donbock** · Jan 22 '10, 10:24 PM

This link is aimed at gcc users: The Unicode HOWTO. Note that you must use glibc-2.2 or later. Consider using libutf8 (version 0.7.3 or later).

**freeseif** · Jan 25 '10, 08:26 PM

I find a solution to my problem, i want share the solution to any one interested by reading UTF-8 file in C99. :)

Code:

void ReadUTF8(FILE* fp)
{
    unsigned char iobuf[255] = {0};
    while( fgets((char*)iobuf, sizeof(iobuf), fp) )
    {
            size_t len = strlen((char *)iobuf);
            if(len > 1 &&  iobuf[len-1] == '\n')
                iobuf[len-1] = 0;
            len = strlen((char *)iobuf);
            printf("(%d) \"%s\"  ", len, iobuf);
            if( iobuf[0] == '\n' )
                printf("Yes\n");
            else
                printf("No\n");
    }
}

void ReadUTF16BE(FILE* fp)
{
}

void ReadUTF16LE(FILE* fp)
{
}

int main()
{
    FILE* fp = fopen("test_utf8.txt", "r");
    if( fp != NULL)
    {
        // see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
        // encoding
        unsigned char b[3] = {0};
        fread(b,1,2, fp);
        if( b[0] == 0xEF && b[1] == 0xBB)
        {
            fread(b,1,1,fp); // 0xBF
            ReadUTF8(fp);
        }
        else if( b[0] == 0xFE && b[1] == 0xFF)
        {
            ReadUTF16BE(fp);
        }
        else if( b[0] == 0 && b[1] == 0)
        {
            fread(b,1,2,fp); 
            if( b[0] == 0xFE && b[1] == 0xFF)
                ReadUTF16LE(fp);
        }
        else
        {
            // we don't know what kind of file it is, so assume its standard
            // ascii with no BOM encoding
            rewind(fp);
            ReadUTF8(fp);
        }
    }        

    fclose(fp);
}

**RedSon** · Jan 25 '10, 08:37 PM

I would not recommend anyone to use this code:

The ISO/ANSI C standard contains, in an amendment which was added in 1995, a "wide character" type `wchar_t', a set of functions like those found in <string.h> and <ctype.h> (declared in <wchar.h> and <wctype.h>, respectively), and a set of conversion functions between `char *' and `wchar_t *' (declared in <stdlib.h>).

READ THIS!!: http://www.faqs.org/docs/Linux-HOWTO...OWTO.html#toc6

How to read unicode (utf-8) / binary file line by line

How to read unicode (utf-8) / binary file line by line

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment