Working with binary files in C++

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • knapak

    Working with binary files in C++

    Hello

    I'm a self instructed amateur attempting to read a huge file from disk... so
    bear with me please... I just learned that reading a file in binary is
    faster than text. So I wrote the following code that compiles OK. It runs and
    shows the requested output. However, after execution, it pops one of those
    windows to send error reports online to the porgram creator. I have managed
    to find where the error is but can't see what's wrong. I'm posting the whole
    code for context. I'm also marking where the problem is.

    I appreciate your assistance. Thanks

    #include <fstream>
    #include <iostream>
    #include <map>
    using namespace std;

    int main()
    {
    typedef map<int, double> IMAP;
    IMAP Grid, NewGrid;

    int IntValue1, rows = 3;
    double DouValue2;

    for(int i=0; i < rows; i++)
    {
    IntValue1 = i + 1;
    DouValue2 = i * 2;
    Grid.insert(IMA P::value_type(I ntValue1, DouValue2));
    }

    IMAP::const_ite rator IteratorG = Grid.begin();

    cout << "Original Map" << endl;
    while (IteratorG != Grid.end() )
    {
    cout << IteratorG->first << " " << IteratorG->second << endl;
    IteratorG ++;
    }

    ofstream FileOut("C:/MyBinary.bin" , ios::binary);
    FileOut.write(( char*) &Grid, sizeof Grid);
    FileOut.close() ;

    // ******** PROBLEM IN HERE *************** ***
    ifstream FileIn("C:/MyBinary.bin", ios::binary);
    FileIn.read((ch ar*) &NewGrid,siz eof NewGrid);
    FileIn.close();
    // *************** *************** ***********

    IMAP::const_ite rator NewIteratorG = NewGrid.begin() ;

    cout << " " << endl;
    cout << "New Map" << endl;
    while (NewIteratorG != NewGrid.end() )
    {
    cout << NewIteratorG->first << " " << NewIteratorG->second << endl;
    NewIteratorG ++;
    }

    return 0;
    }

  • Tom Widmer

    #2
    Re: Working with binary files in C++

    knapak wrote:[color=blue]
    > Hello
    >
    > I'm a self instructed amateur attempting to read a huge file from disk... so
    > bear with me please... I just learned that reading a file in binary is
    > faster than text.[/color]

    However, writing in binary has a lot of potential problems, the main one
    being that you can't write pointers, references or any non-POD types as
    binary.

    So I wrote the following code that compiles OK. It runs and[color=blue]
    > shows the requested output. However, after execution, it pops one of those
    > windows to send error reports online to the porgram creator. I have managed
    > to find where the error is but can't see what's wrong. I'm posting the whole
    > code for context. I'm also marking where the problem is.
    >
    > I appreciate your assistance. Thanks
    >
    > #include <fstream>
    > #include <iostream>
    > #include <map>
    > using namespace std;
    >
    > int main()
    > {
    > typedef map<int, double> IMAP;
    > IMAP Grid, NewGrid;
    >
    > int IntValue1, rows = 3;
    > double DouValue2;
    >
    > for(int i=0; i < rows; i++)
    > {
    > IntValue1 = i + 1;
    > DouValue2 = i * 2;
    > Grid.insert(IMA P::value_type(I ntValue1, DouValue2));
    > }
    >
    > IMAP::const_ite rator IteratorG = Grid.begin();
    >
    > cout << "Original Map" << endl;
    > while (IteratorG != Grid.end() )
    > {
    > cout << IteratorG->first << " " << IteratorG->second << endl;
    > IteratorG ++;[/color]

    Prefer pre-increment where possible, since it can be faster:
    ++IteratorG;
    [color=blue]
    > }[/color]

    The above would normally be a written as a for loop.
    [color=blue]
    >
    > ofstream FileOut("C:/MyBinary.bin" , ios::binary);
    > FileOut.write(( char*) &Grid, sizeof Grid);[/color]

    Ok, the above just wrote out the internal structure of a map object.
    This structure probably consists of pointers out to various nodes of the
    map, such as the root, begin and end nodes, and probably a variable
    holding the size of the map. So, as a guess, the above code is writing
    out the values of three pointers to structures internal to the map, and
    not one entry stored in the map is actually written out.
    [color=blue]
    > FileOut.close() ;
    >
    > // ******** PROBLEM IN HERE *************** ***
    > ifstream FileIn("C:/MyBinary.bin", ios::binary);
    > FileIn.read((ch ar*) &NewGrid,siz eof NewGrid);[/color]

    The above is writing over the internal pointers and size stored in the
    NewGrid object, which has undefined behaviour. You now have two
    different map objects (Grid and NewGrid) that are sharing the same
    internal data structures! This means that both Grid and NewGrid will
    attempt to destroy the same structures when they go out of scope, which
    will crash at best, and corrupt the heap in some more subtle way at worst.

    In order to write out a map in either text or binary, you have to
    iterate over the elements in the map and write them out one by one. You
    are only allowed to binary read/write built in types, like int and
    double, and C style structs that have no constructor/destructor or
    private data. e.g. this is ok:

    struct A
    {
    int a;
    double b;
    };

    but std::pair (for example) is not. So to do the binary writing you need
    to get down to the level of individual keys and values. e.g.

    //write out the size, so we know how much to read in:
    IMAP::size_type size = Grid.size();
    //use reinterpret_cas t to show we're doing something strange
    FileOut.write(r einterpret_cast <char*>(&size ), sizeof size);
    for (IMAP::const_it erator i = Grid.begin();
    i != end;
    ++i)
    {
    FileOut.write(
    reinterpret_cas t<char const*>(&i->first),
    sizeof i->first);
    FileOut.write(
    reinterpret_cas t<char const*>(&i->second),
    sizeof i->second);
    }

    Finally, read them into a new map like this:
    IMAP NewGrid;
    IMAP::size_type size;
    FileIn.read(rei nterpret_cast<c har*>(&size), sizeof size);
    //now we know how many entries to read
    for (IMAP::size_typ e i = 0; i < size; ++i)
    {
    int key;
    double value;
    FileIn.read(
    reinterpret_cas t<char*>(&key) ,
    sizeof key);
    FileIn.read(
    reinterpret_cas t<char*>(&value ),
    sizeof value);
    //finally add it to the map
    NewGrid.insert( IMAP::value_typ e(key, value));
    }

    Hopefully, that should do it, but note that I haven't compiled or tested
    the code. As a final point, you should check the return value of every
    call to read and write to make sure IO hasn't failed. You should also
    note that binary files written as above generally aren't portable - you
    won't be able to load the file using a PowerPC based MAC, for example.

    Tom

    Comment

    • knapak

      #3
      Re: Working with binary files in C++

      Tom

      Thank you so much for your help, this problem was driving me nuts!!!

      A couple of things. The whole purpose of this code is to reduce the time to
      load a big data file and load it into a map or multimap to be able to quickly
      find a record in the maze of data. When I did it with text files it took a
      grueling 40 minutes to read the file... yup only to read the file. Using
      binaries was suggested to me to reduce the reading time by loading the data
      in "one big chunk". I don't know if this is correct or not, but it certainly
      reduced the time. Now your suggestion goes reading one record at a time...
      mind me, your suggested code does work and takes only a few seconds to read
      the data. Still, I wonder if those few seconds could still be somehow reduced
      say from 8 to 4... I know I'm being ambitious, but I'd like to optimize this
      part of the program as much as possible. If not, I'll be happy with this
      solution.

      The second question is related to your comment about portability. A file
      saved as binary with this code in Windows cannot be read in UNIX? I thought
      binary files could be read anywhere... Can this problem be solved? For
      example, should I leave the data file as text (ASCII) and load it as binary
      in the same amount of time? Can then the same file be read both in Windows
      and UNIX?

      Again thanks a million for you kind assistance!

      Carlos

      "Tom Widmer" wrote:
      [color=blue]
      > knapak wrote:[color=green]
      > > Hello
      > >
      > > I'm a self instructed amateur attempting to read a huge file from disk... so
      > > bear with me please... I just learned that reading a file in binary is
      > > faster than text.[/color]
      >
      > However, writing in binary has a lot of potential problems, the main one
      > being that you can't write pointers, references or any non-POD types as
      > binary.
      >
      > So I wrote the following code that compiles OK. It runs and[color=green]
      > > shows the requested output. However, after execution, it pops one of those
      > > windows to send error reports online to the porgram creator. I have managed
      > > to find where the error is but can't see what's wrong. I'm posting the whole
      > > code for context. I'm also marking where the problem is.
      > >
      > > I appreciate your assistance. Thanks
      > >
      > > #include <fstream>
      > > #include <iostream>
      > > #include <map>
      > > using namespace std;
      > >
      > > int main()
      > > {
      > > typedef map<int, double> IMAP;
      > > IMAP Grid, NewGrid;
      > >
      > > int IntValue1, rows = 3;
      > > double DouValue2;
      > >
      > > for(int i=0; i < rows; i++)
      > > {
      > > IntValue1 = i + 1;
      > > DouValue2 = i * 2;
      > > Grid.insert(IMA P::value_type(I ntValue1, DouValue2));
      > > }
      > >
      > > IMAP::const_ite rator IteratorG = Grid.begin();
      > >
      > > cout << "Original Map" << endl;
      > > while (IteratorG != Grid.end() )
      > > {
      > > cout << IteratorG->first << " " << IteratorG->second << endl;
      > > IteratorG ++;[/color]
      >
      > Prefer pre-increment where possible, since it can be faster:
      > ++IteratorG;
      >[color=green]
      > > }[/color]
      >
      > The above would normally be a written as a for loop.
      >[color=green]
      > >
      > > ofstream FileOut("C:/MyBinary.bin" , ios::binary);
      > > FileOut.write(( char*) &Grid, sizeof Grid);[/color]
      >
      > Ok, the above just wrote out the internal structure of a map object.
      > This structure probably consists of pointers out to various nodes of the
      > map, such as the root, begin and end nodes, and probably a variable
      > holding the size of the map. So, as a guess, the above code is writing
      > out the values of three pointers to structures internal to the map, and
      > not one entry stored in the map is actually written out.
      >[color=green]
      > > FileOut.close() ;
      > >
      > > // ******** PROBLEM IN HERE *************** ***
      > > ifstream FileIn("C:/MyBinary.bin", ios::binary);
      > > FileIn.read((ch ar*) &NewGrid,siz eof NewGrid);[/color]
      >
      > The above is writing over the internal pointers and size stored in the
      > NewGrid object, which has undefined behaviour. You now have two
      > different map objects (Grid and NewGrid) that are sharing the same
      > internal data structures! This means that both Grid and NewGrid will
      > attempt to destroy the same structures when they go out of scope, which
      > will crash at best, and corrupt the heap in some more subtle way at worst.
      >
      > In order to write out a map in either text or binary, you have to
      > iterate over the elements in the map and write them out one by one. You
      > are only allowed to binary read/write built in types, like int and
      > double, and C style structs that have no constructor/destructor or
      > private data. e.g. this is ok:
      >
      > struct A
      > {
      > int a;
      > double b;
      > };
      >
      > but std::pair (for example) is not. So to do the binary writing you need
      > to get down to the level of individual keys and values. e.g.
      >
      > //write out the size, so we know how much to read in:
      > IMAP::size_type size = Grid.size();
      > //use reinterpret_cas t to show we're doing something strange
      > FileOut.write(r einterpret_cast <char*>(&size ), sizeof size);
      > for (IMAP::const_it erator i = Grid.begin();
      > i != end;
      > ++i)
      > {
      > FileOut.write(
      > reinterpret_cas t<char const*>(&i->first),
      > sizeof i->first);
      > FileOut.write(
      > reinterpret_cas t<char const*>(&i->second),
      > sizeof i->second);
      > }
      >
      > Finally, read them into a new map like this:
      > IMAP NewGrid;
      > IMAP::size_type size;
      > FileIn.read(rei nterpret_cast<c har*>(&size), sizeof size);
      > //now we know how many entries to read
      > for (IMAP::size_typ e i = 0; i < size; ++i)
      > {
      > int key;
      > double value;
      > FileIn.read(
      > reinterpret_cas t<char*>(&key) ,
      > sizeof key);
      > FileIn.read(
      > reinterpret_cas t<char*>(&value ),
      > sizeof value);
      > //finally add it to the map
      > NewGrid.insert( IMAP::value_typ e(key, value));
      > }
      >
      > Hopefully, that should do it, but note that I haven't compiled or tested
      > the code. As a final point, you should check the return value of every
      > call to read and write to make sure IO hasn't failed. You should also
      > note that binary files written as above generally aren't portable - you
      > won't be able to load the file using a PowerPC based MAC, for example.
      >
      > Tom
      >[/color]

      Comment

      • Tom Widmer

        #4
        Re: Working with binary files in C++

        knapak wrote:[color=blue]
        > Tom
        >
        > Thank you so much for your help, this problem was driving me nuts!!!
        >
        > A couple of things. The whole purpose of this code is to reduce the time to
        > load a big data file and load it into a map or multimap to be able to quickly
        > find a record in the maze of data. When I did it with text files it took a
        > grueling 40 minutes to read the file... yup only to read the file.Using
        > binaries was suggested to me to reduce the reading time by loading the data
        > in "one big chunk". I don't know if this is correct or not, but it certainly
        > reduced the time.[/color]

        Unfortunately, std::map doesn't sit in memory in one large chunk - there
        is one chunk for each entry in the map, so there is no way to write out
        the map without iterating over the entries.

        Now your suggestion goes reading one record at a time...[color=blue]
        > mind me, your suggested code does work and takes only a few seconds to read
        > the data. Still, I wonder if those few seconds could still be somehow reduced
        > say from 8 to 4... I know I'm being ambitious, but I'd like to optimize this
        > part of the program as much as possible. If not, I'll be happy with this
        > solution.[/color]

        I'm sure it is possible to reduce the time further. One approach is to
        remove the calls to "read" and "write" and replace them with calls like
        this:

        FileOut.rdbuf()->sputn(same params as for write);

        FileIn.rdbuf()->sgetn(same params as for read);

        sputn/sgetn are quite a bit faster than write/read.

        Another approach is to take the map and transfer its contents to a
        vector, which can be written out in one chunk. I've posted two different
        approaches, one legal but a bit slower, the other illegal, but likely to
        work on most platforms:

        typedef map<int, double> IMAP;

        struct IMAP_POD
        {
        int key;
        double value;
        };

        struct IMAPConverter
        {
        IMAP_POD operator()(IMAP ::const_referen ce val) const
        {
        IMAP_POD p = {val.first, val.second};
        return p;
        }

        std::pair<int, double> operator()(IMAP _POD const& val) const
        {
        return std::pair<int, double>(val.key , val.value);
        }
        };

        void writeIMAP(IMAP const& m, ostream& os)
        {
        vector<IMAP_POD > v(m.size());
        transform(m.beg in(), m.end(), v.begin(), IMAPConverter() );
        //write the size:
        vector<IMAP_POD >::size_type size = v.size();
        os.write(reinte rpret_cast<char *>(&size), sizeof size);
        //write the map as a single vector:
        os.write(reinte rpret_cast<char *>(&v[0]), v.size() * sizeof v[0]);
        }

        void readIMAP(IMAP& m, istream& is)
        {
        vector<IMAP_POD >::size_type size;
        //read the size:
        is.read(reinter pret_cast<char* >(&size), sizeof size);
        vector<IMAP_POD > v(size);
        //read the map as a single vector:
        is.read(reinter pret_cast<char* >(&v[0]), v.size() * sizeof v[0]);
        vector<std::pai r<int, double> > typedV;
        typedV.reserve( size);
        transform(v.beg in(), v.end(), back_inserter(t ypedV), IMAPConverter() );
        //range insert for a sorted range
        //is much faster than inserting one by one
        m.insert(typedV .begin(), typedV.end());
        }


        Illegal approach:

        typedef map<int, double> IMAP;

        void writeIMAP(IMAP const& m, ostream& os)
        {
        typedef std::pair<int, double> non_const_value _type;
        vector<non_cons t_value_type> v;
        v.reserve(m.siz e());
        v.insert(v.begi n(), m.begin(), m.end());
        //write the size:
        vector<non_cons t_value_type>:: size_type size = v.size();
        os.write(reinte rpret_cast<char *>(&size), sizeof size);
        //write the map as a single vector:
        os.write(reinte rpret_cast<char *>(&v[0]), v.size() * sizeof v[0]);
        }

        void readIMAP(IMAP& m, istream& is)
        {
        typedef std::pair<int, double> non_const_value _type;
        vector<non_cons t_value_type>:: size_type size;
        //read the size:
        is.read(reinter pret_cast<char* >(&size), sizeof size);
        vector<non_cons t_value_type> v(size);
        //read the map as a single vector:
        is.read(reinter pret_cast<char* >(&v[0]), v.size() * sizeof v[0]);
        m.insert(v.begi n(), v.end());
        }

        The reason that is illegal is that you can only copy the bytes into and
        out of POD types, and std::pair<int, double> is not a POD type. However,
        pair<int, double> is close to being a POD type (it doesn't have any base
        classes or virtual functions, and the destructor is basically a no-op),
        so the above is very likely to work on every platform.
        [color=blue]
        > The second question is related to your comment about portability. A file
        > saved as binary with this code in Windows cannot be read in UNIX?[/color]

        The problem here is the format used by the CPU and compiler to hold ints
        and doubles, and the sizes of those types. For example, some CPUs use a
        big endian 64-bit 2s complement format for "int", while Windows (and
        UNIX) compilers for x86 use a little endian 32-bit 2s-complement format.
        Basically, the bits for a particular int value (such as 1234567) are
        quite different on some platforms.

        There's a bit more on it here:


        If you want portable binary, you need to decide on exactly the binary
        format you want, and then make sure that your code writes out bytes in
        the right format (byte-order swapping and padding as necessary).

        I thought[color=blue]
        > binary files could be read anywhere... Can this problem be solved? For
        > example, should I leave the data file as text (ASCII) and load it as binary
        > in the same amount of time? Can then the same file be read both in Windows
        > and UNIX?[/color]

        It should be possible to optimize the code you use to read it as text so
        that it operates much faster. If you need portability, this may be the
        best option. If you want to do this, I'd suggest posting the code you
        have (in a new thread) and asking for help in speeding it up.

        Tom

        Comment

        • knapak

          #5
          Re: Working with binary files in C++

          Tom

          Thanks again for your invaluable help. As for the alternative to write and
          read, yes it improved the loading time... by about 0.4 of a sec (4.2 to
          3.8)... which to me is quite good. I have to admit that your methods were
          completely unknwon to me (remember I'm an amateur). I guess my only question
          would be if there's is any room for problems by using the reinterpret_cas t.

          As for the protability problem, you actually suggested to explore some
          alternatives of standardized binary formats including netCDF. Actually I've
          tried using netCDF but didn't quite follow the procedures and is very
          difficult to find people with the expertise to provide assistance. For now
          I'll try to work with your solution and eventually when my files get bigger
          and do require switching between windows and unix I'll come back and ask
          directly if anyone knows how to work with netCDF.

          I very much appreciate the time you took to help me.

          Carlos

          Comment

          Working...