How to read unicode (utf-8) / binary file line by line

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • freeseif
    New Member
    • Jan 2010
    • 13

    How to read unicode (utf-8) / binary file line by line

    Hi programmers,

    I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.

    This code read ANSI file line by line, and compare the strings

    What i want
    • Read test_ansi.txt line by line
    • if the line = "b" print "YES!"
    • else print "NO!"


    read_ansi_line_ by_line.c

    Code:
    #include <stdio.h>
    
    int main()
    {
        char *inname = "test_ansi.txt";
        FILE *infile;
        char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
        char line_number;
    
        infile = fopen(inname, "r");
        if (!infile) {
            printf("\nfile '%s' not found\n", inname);
            return 0;
        }
        printf("\n%s\n\n", inname);
    
        line_number = 0;
        while (fgets(line_buffer, sizeof(line_buffer), infile)) {
            ++line_number;
            /* note that the newline is in the buffer */
            if (strcmp("b\n", line_buffer) == 0 ){
                printf("%d: YES!\n", line_number);
            }else{
                printf("%d: NO!\n", line_number,line_buffer);
            }
        }
        printf("\n\nTotal: %d\n", line_number);
        return 0;
    }
    test_ansi.txt

    Code:
    a
    b
    c
    Compiling

    Code:
    gcc -o read_ansi_line_by_line read_ansi_line_by_line.c
    Output

    Code:
    test_ansi.txt
    
    1: NO!
    2: YES!
    3: NO!
    
    
    Total: 3
    Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!

    Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!

    This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC

    What i want
    • Write the Unicode char "ب" to test_bin.dat


    create_bin.c

    Code:
    #define UNICODE
    #ifdef UNICODE
    #define _UNICODE
    #else
    #define _MBCS
    #endif
    
    #include <stdio.h>
    #include <wchar.h>
    
    int main()
    {
         /*Data to be stored in file*/
         wchar_t line_buffer[BUFSIZ]=L"ب";
         /*Opening file for writing in binary mode*/
         FILE *infile=fopen("test_bin.dat","wb");
         /*Writing data to file*/
         fwrite(line_buffer, 1, 13, infile);
         /*Closing File*/
         fclose(infile);
    
        return 0;
    }
    Compiling

    Code:
    gcc -o create_bin create_bin.c
    Output

    Code:
    create test_bin.dat


    Now i want read the binary file line by line and compare!

    What i want
    • Read test_bin.dat line by line
    • if the line = "ب" print "YES!"
    • else print "NO!"


    read_bin_line_b y_line.c

    Code:
    #define UNICODE
    #ifdef UNICODE
    #define _UNICODE
    #else
    #define _MBCS
    #endif
    
    #include <stdio.h>
    #include <wchar.h>
    
    int main()
    {
        wchar_t *inname = L"test_bin.dat";
        FILE *infile;
        wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
    
        infile = _wfopen(inname,L"rb");
        if (!infile) {
            wprintf(L"\nfile '%s' not found\n", inname);
            return 0;
        }
        wprintf(L"\n%s\n\n", inname);
    
        /*Reading data from file into temporary buffer*/
        while (fread(line_buffer,1,13,infile)) {
            /* note that the newline is in the buffer */
            if ( wcscmp ( L"ب" , line_buffer ) == 0 ){
                 wprintf(L"YES!\n");
            }else{
                 wprintf(L"NO!\n", line_buffer);
            }
        }
        /*Closing File*/
        fclose(infile);
        return 0;
    }
    Compiling

    Code:
    gcc -o read_bin_line_by_line read_bin_line_by_line.c
    Output

    Code:
    test_bin.dat
    
    YES!
    THE PROBLEM

    This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)

    Please any one know how to read Unicode file ? (i know its not easy!) Please any one know how to convert Unicode file to Binary file ? (simple method) Please any one know how to read Unicode file in binary mode ? (i m not sure)

    Thank You.
  • johny10151981
    Top Contributor
    • Jan 2010
    • 1059

    #2
    Hello,
    few things.
    1. UNICODE and utf-8 is not same(if i am not wrong). UNICODE is 2 byte long. On the other hand UTF-8 is a multybyte encoding system.

    2. (dont listen to me). Looking for a easy way. not a good idea :)

    Best Regrads,
    JOHNY

    Comment

    • RedSon
      Recognized Expert Expert
      • Jan 2007
      • 4980

      #3
      Instead of using fgets and strcmp you are going to want to use the wide character version of those methods.

      You will have to read the documentation of the OS/libraries you are using to find out what the wide char variants are.

      If you are using Windows a quick search on MSDN should be helpful. Also you can do conversions from one to the other.

      Comment

      • freeseif
        New Member
        • Jan 2010
        • 13

        #4
        @JOHNY

        Yes, you are right, i want edit title to remove "unicode" but no permission ^_^
        if you have a UTF-8 project, and you want to read UTF-8 file line by line, what is the easy way you use ? =)

        Comment

        • freeseif
          New Member
          • Jan 2010
          • 13

          #5
          @RedSon

          6 months of searching in Books, MSDN, Documentations, Internet, Forums.. i never found a solution to read UTF-8 file in C99!, can you help me please ? =)

          Comment

          • RedSon
            Recognized Expert Expert
            • Jan 2007
            • 4980

            #6
            Did you read the Unicode and Character Set functions on MSDN?

            Comment

            • RedSon
              Recognized Expert Expert
              • Jan 2007
              • 4980

              #7
              Oh wait, if you are using gcc then you are not on a windows machine, so that MSDN link is not going to do you any good. I don't know why you are even searching MSDN like you state in post #5.

              That is why I suggested that you search your libraries and other documentation for wide string functions. Your header files that come with C99 should have something for that.

              Comment

              • freeseif
                New Member
                • Jan 2010
                • 13

                #8
                @RedSon

                First Thank you, i already read all MSDN pages that talking about UTF-8 ^_^, but i think i need use MultiByteToWide Char() after reading string from UTF-8 file, but i don't know how to use exactly!

                Comment

                • freeseif
                  New Member
                  • Jan 2010
                  • 13

                  #9
                  @RedSon

                  Yes, i m looking for a solution in C99 with GCC, i think i need read the UTF-8 file in binary mode and convert UTF-8 to UTF-16 or not! or other way.. i need help seriously =)

                  Comment

                  • RedSon
                    Recognized Expert Expert
                    • Jan 2007
                    • 4980

                    #10
                    Like I said, you won't be able to use it, because you are not building a windows application using windows libraries.

                    You will need to find an appropriate library call in your headers.

                    Comment

                    • donbock
                      Recognized Expert Top Contributor
                      • Mar 2008
                      • 2427

                      #11
                      This link is aimed at gcc users: The Unicode HOWTO. Note that you must use glibc-2.2 or later. Consider using libutf8 (version 0.7.3 or later).

                      Comment

                      • freeseif
                        New Member
                        • Jan 2010
                        • 13

                        #12
                        I find a solution to my problem, i want share the solution to any one interested by reading UTF-8 file in C99. :)

                        Code:
                        void ReadUTF8(FILE* fp)
                        {
                            unsigned char iobuf[255] = {0};
                            while( fgets((char*)iobuf, sizeof(iobuf), fp) )
                            {
                                    size_t len = strlen((char *)iobuf);
                                    if(len > 1 &&  iobuf[len-1] == '\n')
                                        iobuf[len-1] = 0;
                                    len = strlen((char *)iobuf);
                                    printf("(%d) \"%s\"  ", len, iobuf);
                                    if( iobuf[0] == '\n' )
                                        printf("Yes\n");
                                    else
                                        printf("No\n");
                            }
                        }
                        
                        void ReadUTF16BE(FILE* fp)
                        {
                        }
                        
                        void ReadUTF16LE(FILE* fp)
                        {
                        }
                        
                        int main()
                        {
                            FILE* fp = fopen("test_utf8.txt", "r");
                            if( fp != NULL)
                            {
                                // see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
                                // encoding
                                unsigned char b[3] = {0};
                                fread(b,1,2, fp);
                                if( b[0] == 0xEF && b[1] == 0xBB)
                                {
                                    fread(b,1,1,fp); // 0xBF
                                    ReadUTF8(fp);
                                }
                                else if( b[0] == 0xFE && b[1] == 0xFF)
                                {
                                    ReadUTF16BE(fp);
                                }
                                else if( b[0] == 0 && b[1] == 0)
                                {
                                    fread(b,1,2,fp); 
                                    if( b[0] == 0xFE && b[1] == 0xFF)
                                        ReadUTF16LE(fp);
                                }
                                else
                                {
                                    // we don't know what kind of file it is, so assume its standard
                                    // ascii with no BOM encoding
                                    rewind(fp);
                                    ReadUTF8(fp);
                                }
                            }        
                        
                            fclose(fp);
                        }

                        Comment

                        • RedSon
                          Recognized Expert Expert
                          • Jan 2007
                          • 4980

                          #13
                          I would not recommend anyone to use this code:

                          The ISO/ANSI C standard contains, in an amendment which was added in 1995, a "wide character" type `wchar_t', a set of functions like those found in <string.h> and <ctype.h> (declared in <wchar.h> and <wctype.h>, respectively), and a set of conversion functions between `char *' and `wchar_t *' (declared in <stdlib.h>).

                          READ THIS!!: http://www.faqs.org/docs/Linux-HOWTO...OWTO.html#toc6

                          Comment

                          Working...