Assumptions
I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C++ programming.
FYI
Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make sense of it.
Think of lexing as reading in a bunch of words, and parsing as reading in a sentence. Each word means something, but without the context of the sentence, it doesn’t mean anything very useful.
I didn’t use the title “How to do Lexical Analysis in C++” because most of you probably don’t know what that means. If you do, then I apologies.
Introduction
Hi, last time I showed you all how to parse a file in C. In this article, I will now address how to parse a file in C++.
For those who haven’t read that article, please read it under the section of Streams and Files as this is the same for C++ as it is for C. However, when using the C++ streams, instead of using stdin, stdout and stderr, you use cin, cout and cerr respectively.
Buffering and Double Buffering
Double buffering means to dump from one buffer into another prior to processing/displaying. In C++ all of the stream libraries are buffered.
Parsing a File
Parsing a file can be done quite simply using the described buffering techniques.
Parsing Without Double Buffering
To parse a file without double buffering is not always possible. The only way to do it would be to read and store only numbers.
E.g. here is a sample file:
To read that in without double buffering you could loop around the following:
[code=cpp]
// CODE FRAGMENT 1
int itemsParsed = 0;
int items[5];
for (itemsParsed = 0; itemsParsed < 5 && cin.good(); ++itemsParsed) {
cin >> items[itemsParsed];
if (itemsParsed != 4 && cin.peek() == ’,’) {
cin.ignore(1); // clear out comma
}
}
if (!cin.good()) {
--itemsParsed;
// check what flag was set and act appropriately
//...
if (!cin.eof()) {
cin.clear(); // Clear the error flag (unless it is eof)
}
}
[/code]
Note that commas are required in the input stream after every number. There can be 0 or more whitespaces after the comma. A whitespace can be a regular space, a tab, vertical tab (rarely ever used), a carriage return or a line feed.
Code Fragment 1a is a bit simpler as it separates the normal code flow from the exceptional one using C++ exception handling.
[code=cpp]
// CODE FRAGMENT 1A
int itemsParsed = 0;
int items[5];
cin.exceptions( ~ios::goodbit); // turn on exceptions
try {
for (itemsParsed = 0; itemsParsed < 5; ++itemsParsed) {
cin >> items[itemsParsed];
if (itemsParsed != 4 && cin.peek() == ’,’) {
cin.ignore(1); // clear out comma
}
}
}
catch(ios_base: :failure failure) {
assert (!cin.good());
// check what flag was set and act appropriately
//...
if (!cin.eof()) {
cin.clear(); // Clear the error flag (unless it is eof)
}
}
[/code]
Both Code Fragment 1 and Code Fragment 1A are patterned after Code Fragment 1 in How to Parse a File in C. Some may argue that the C code is more readable. This may be true in some cases, but the C code lacks one thing, it is meant only for base types.
In C++, the extraction operator (‘>>’) allows you to do something different. You can overload that operator and make it read in anything you want just as if it were part of the language. What it is in fact is only a call to a function. One could do something similar in C, but it would look like a function call. All that operator overloading is, is syntactic sugar making an operator just a callable function. Some say it isn’t necessary, others say that it makes it cleaner. My opinion is that I have none. It is just another way of doing the same thing. I think the saying goes “same s**t, different shovel”. ;)
The following code fragment shows just how to use this sweetened syntax to your advantage.
[code=cpp]
#include <iostream>
#include <assert.h>
using namespace std;
// CODE 1
class Point2D
{
int x, y;
public:
Point2D() : x(0), y(0) {}
int getX() { return x; }
int getY() { return y; }
void setX(int x) { this->x = x; }
void setY(int y) { this->y = y; }
};
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure)
{
ios_base::iosta te oldIOState = is.exceptions() ;
cin.exceptions( ~ios::goodbit); // turn on exceptions
try {
int val;
is >> val;
point.setX(val) ;
if (is.peek() == ’,’) {
is.ignore(1);
}
else {
is.setstate(ios _base::failbit) ;
throw ios_base::failu re(“Missing comma separator”);
}
is >> val;
point.setY(val) ;
}
catch(ios_base: :failure failure) {
assert (!is.good());
// check what flag was set and act appropriately
//...
is.exceptions(o ldIOState); // restoring old IO exception handling
throw; // there is no way to recover the stream without more info
}
is.exceptions(o ldIOState); // restoring old IO exception handling
return is;
}
int main()
{
Point2D point;
try {
cin >> point;
cout << “(“ << point.getX() << “, “ << point.getY() << “)” << endl;
}
catch(ios_base: :failure failure) {
if (cin.bad()) {
cout << “cin bad” << endl;
}
if (cin.fail()) {
cout << “cin failed” << endl;
}
if (cin.eof()) {
cout << “cin hit eof” << endl;
}
if (!cin.eof()) {
cin.clear(); // Clear the error flag (unless it is eof)
}
}
}
[/code]
Now what CODE 1 does is that you create a class and overload the extraction operator, thus allowing you to extract data from a stream and have it placed into the class. You never have to write this code again. Additionally, you can overload the insertion operator (‘<<’) and have it so it outputs like I did without having to write it out explicitly every time like I did. I leave that as an exercise up to the reader.
You should note that I am using the interface functions to write to the class. If I wanted to closely couple the extraction operator with the class, then I may not want to read the data into a temporary variable and then copy it over to the class. To do that it would require that I make the extraction operator a friend of the class. The modifications follow:
[code=cpp]
// CODE 1A
// SEE NOTE BELOW regarding next two lines
class Point2D;
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure);
class Point2D
{
int x, y;
friend
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure);
public:
Point2D() : x(0), y(0) {}
int getX() { return x; }
int getY() { return y; }
void setX(int x) { this->x = x; }
void setY(int y) { this->y = y; }
};
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure)
{
ios_base::iosta te oldIOState = is.exceptions() ;
cin.exceptions( ~ios::goodbit); // turn on exceptions
try {
is >> point.x;
if (is.peek() == ’,’) {
is.ignore(1);
}
else {
is.setstate(ios _base::failbit) ;
throw ios_base::failu re(“Missing comma separator”);
}
is >> point.y;
}
catch(ios_base: :failure failure) {
assert (!is.good());
// check what flag was set and act appropriately
//...
is.exceptions(o ldIOState); // restoring old IO exception handling
throw; // there is no way to recover the stream without more info
}
is.exceptions(o ldIOState); // restoring old IO exception handling
return is;
}
[/code]
NOTE: Lines 4 and 5 at the top of CODE 1 are very important. When declaring a friend, you must either declare or define the function call before the class it is declared as a friend in. If you don’t, the compiler will probably complain about friend injection which is a deprecated feature, or that the function was not declared.
To do regular expression parsing of a stream, you would need a regular expression library. That is beyond the scope of this document. In standard C++, there is no equivalent of scanf’s character classes. You would have to implement them or something like them yourself or download a non-standard library that someone else has created. For this, I would recommend BOOST.org as a good resource. They make libraries that are presented to the C++ committee for possible inclusion in the next standards revision.
Parsing Using Double Buffering
In the previous section, I showed how to not double buffer the data. There are times however, when this is not possible.
Reading in a string or series of characters intrinsically requires the use of double buffering. The data is read to an internal buffer and then copied to your programme’s data space. You can then do with it any way you wish, by either further processing it or displaying it.
To read in a whitespace delimited string, you can still use the extraction operator but use it on a string or a char array
NOTE: use width() function on the input stream when using extraction operator on a char array or you may overrun your buffer, its parameter includes the terminating NULL. Alternatively, you can use setw() but you must include <iomanip> header file.
To read in a line, use the getline() function from the string library (#include <string>). It too will take care of allocation for you. Alternatively, use the native istream::getlin e() but you must specify the size of the buffer.
Parsing Using Triple Buffering
Yes, you can buffer the buffer’s buffer. Why would you want to do this? One reason I can think of is to decouple parts of your code from a stream. However, unlike in C where you need to defined your functions passing c-strings, you can use call functions that are already accepting istreams and ostreams. To do this, you use a stringstream (bidirectional) , istringstream (input only) or ostringstream (output only). This can simplify your design and allows you to easily debug already existing systems that use C++ streams.
The following is a simple example of using a bidirectional stringstream.
[code=cpp]
#include <sstream>
#include <iostream>
using namespace std;
int main()
{
stringstream ss;
char buffer1[20] = {}, buffer2[20] = {};
ss << "hello there";
ss >> buffer1 >> buffer2;
cout << buffer1 << endl;
cout << buffer2 << endl;
}[/code]
Since stringstream inherits from istream and ostream, you can use it just like you would one of those classes. Further, istringstream inherits from istream and ostringstream inherits from ostream.
Binary files
As in the “How to Parse in C” document, binary files are beyond this documents scope. If there is enough interest, I will write about it in another document.
Conclusion
Parsing a file is not very difficult but certainly different then in C and though you can read in to a char array like in C, it is not recommended unless you have good reason to do so and take appropriate precautions.
If you have any questions or find anything unclear. Feel free to post a message and I will get back to you and/or update the document when I can.
Adrian
This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
Revision History:
25/05/2007 11:06 ADT
26/05/2007 08:19 ADT
29/05/2007 12:23 ADT
This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C++ programming.
FYI
Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make sense of it.
Think of lexing as reading in a bunch of words, and parsing as reading in a sentence. Each word means something, but without the context of the sentence, it doesn’t mean anything very useful.
I didn’t use the title “How to do Lexical Analysis in C++” because most of you probably don’t know what that means. If you do, then I apologies.
Introduction
Hi, last time I showed you all how to parse a file in C. In this article, I will now address how to parse a file in C++.
For those who haven’t read that article, please read it under the section of Streams and Files as this is the same for C++ as it is for C. However, when using the C++ streams, instead of using stdin, stdout and stderr, you use cin, cout and cerr respectively.
Buffering and Double Buffering
Double buffering means to dump from one buffer into another prior to processing/displaying. In C++ all of the stream libraries are buffered.
Parsing a File
Parsing a file can be done quite simply using the described buffering techniques.
Parsing Without Double Buffering
To parse a file without double buffering is not always possible. The only way to do it would be to read and store only numbers.
E.g. here is a sample file:
Code:
1, 2, 3, 4, 5 6, 7, 8, 9, 10
[code=cpp]
// CODE FRAGMENT 1
int itemsParsed = 0;
int items[5];
for (itemsParsed = 0; itemsParsed < 5 && cin.good(); ++itemsParsed) {
cin >> items[itemsParsed];
if (itemsParsed != 4 && cin.peek() == ’,’) {
cin.ignore(1); // clear out comma
}
}
if (!cin.good()) {
--itemsParsed;
// check what flag was set and act appropriately
//...
if (!cin.eof()) {
cin.clear(); // Clear the error flag (unless it is eof)
}
}
[/code]
Note that commas are required in the input stream after every number. There can be 0 or more whitespaces after the comma. A whitespace can be a regular space, a tab, vertical tab (rarely ever used), a carriage return or a line feed.
Code Fragment 1a is a bit simpler as it separates the normal code flow from the exceptional one using C++ exception handling.
[code=cpp]
// CODE FRAGMENT 1A
int itemsParsed = 0;
int items[5];
cin.exceptions( ~ios::goodbit); // turn on exceptions
try {
for (itemsParsed = 0; itemsParsed < 5; ++itemsParsed) {
cin >> items[itemsParsed];
if (itemsParsed != 4 && cin.peek() == ’,’) {
cin.ignore(1); // clear out comma
}
}
}
catch(ios_base: :failure failure) {
assert (!cin.good());
// check what flag was set and act appropriately
//...
if (!cin.eof()) {
cin.clear(); // Clear the error flag (unless it is eof)
}
}
[/code]
Both Code Fragment 1 and Code Fragment 1A are patterned after Code Fragment 1 in How to Parse a File in C. Some may argue that the C code is more readable. This may be true in some cases, but the C code lacks one thing, it is meant only for base types.
In C++, the extraction operator (‘>>’) allows you to do something different. You can overload that operator and make it read in anything you want just as if it were part of the language. What it is in fact is only a call to a function. One could do something similar in C, but it would look like a function call. All that operator overloading is, is syntactic sugar making an operator just a callable function. Some say it isn’t necessary, others say that it makes it cleaner. My opinion is that I have none. It is just another way of doing the same thing. I think the saying goes “same s**t, different shovel”. ;)
The following code fragment shows just how to use this sweetened syntax to your advantage.
[code=cpp]
#include <iostream>
#include <assert.h>
using namespace std;
// CODE 1
class Point2D
{
int x, y;
public:
Point2D() : x(0), y(0) {}
int getX() { return x; }
int getY() { return y; }
void setX(int x) { this->x = x; }
void setY(int y) { this->y = y; }
};
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure)
{
ios_base::iosta te oldIOState = is.exceptions() ;
cin.exceptions( ~ios::goodbit); // turn on exceptions
try {
int val;
is >> val;
point.setX(val) ;
if (is.peek() == ’,’) {
is.ignore(1);
}
else {
is.setstate(ios _base::failbit) ;
throw ios_base::failu re(“Missing comma separator”);
}
is >> val;
point.setY(val) ;
}
catch(ios_base: :failure failure) {
assert (!is.good());
// check what flag was set and act appropriately
//...
is.exceptions(o ldIOState); // restoring old IO exception handling
throw; // there is no way to recover the stream without more info
}
is.exceptions(o ldIOState); // restoring old IO exception handling
return is;
}
int main()
{
Point2D point;
try {
cin >> point;
cout << “(“ << point.getX() << “, “ << point.getY() << “)” << endl;
}
catch(ios_base: :failure failure) {
if (cin.bad()) {
cout << “cin bad” << endl;
}
if (cin.fail()) {
cout << “cin failed” << endl;
}
if (cin.eof()) {
cout << “cin hit eof” << endl;
}
if (!cin.eof()) {
cin.clear(); // Clear the error flag (unless it is eof)
}
}
}
[/code]
Now what CODE 1 does is that you create a class and overload the extraction operator, thus allowing you to extract data from a stream and have it placed into the class. You never have to write this code again. Additionally, you can overload the insertion operator (‘<<’) and have it so it outputs like I did without having to write it out explicitly every time like I did. I leave that as an exercise up to the reader.
You should note that I am using the interface functions to write to the class. If I wanted to closely couple the extraction operator with the class, then I may not want to read the data into a temporary variable and then copy it over to the class. To do that it would require that I make the extraction operator a friend of the class. The modifications follow:
[code=cpp]
// CODE 1A
// SEE NOTE BELOW regarding next two lines
class Point2D;
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure);
class Point2D
{
int x, y;
friend
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure);
public:
Point2D() : x(0), y(0) {}
int getX() { return x; }
int getY() { return y; }
void setX(int x) { this->x = x; }
void setY(int y) { this->y = y; }
};
istream& operator>>(istr eam& is, Point2D& point) throw (ios_base::fail ure)
{
ios_base::iosta te oldIOState = is.exceptions() ;
cin.exceptions( ~ios::goodbit); // turn on exceptions
try {
is >> point.x;
if (is.peek() == ’,’) {
is.ignore(1);
}
else {
is.setstate(ios _base::failbit) ;
throw ios_base::failu re(“Missing comma separator”);
}
is >> point.y;
}
catch(ios_base: :failure failure) {
assert (!is.good());
// check what flag was set and act appropriately
//...
is.exceptions(o ldIOState); // restoring old IO exception handling
throw; // there is no way to recover the stream without more info
}
is.exceptions(o ldIOState); // restoring old IO exception handling
return is;
}
[/code]
NOTE: Lines 4 and 5 at the top of CODE 1 are very important. When declaring a friend, you must either declare or define the function call before the class it is declared as a friend in. If you don’t, the compiler will probably complain about friend injection which is a deprecated feature, or that the function was not declared.
To do regular expression parsing of a stream, you would need a regular expression library. That is beyond the scope of this document. In standard C++, there is no equivalent of scanf’s character classes. You would have to implement them or something like them yourself or download a non-standard library that someone else has created. For this, I would recommend BOOST.org as a good resource. They make libraries that are presented to the C++ committee for possible inclusion in the next standards revision.
Parsing Using Double Buffering
In the previous section, I showed how to not double buffer the data. There are times however, when this is not possible.
Reading in a string or series of characters intrinsically requires the use of double buffering. The data is read to an internal buffer and then copied to your programme’s data space. You can then do with it any way you wish, by either further processing it or displaying it.
To read in a whitespace delimited string, you can still use the extraction operator but use it on a string or a char array
NOTE: use width() function on the input stream when using extraction operator on a char array or you may overrun your buffer, its parameter includes the terminating NULL. Alternatively, you can use setw() but you must include <iomanip> header file.
To read in a line, use the getline() function from the string library (#include <string>). It too will take care of allocation for you. Alternatively, use the native istream::getlin e() but you must specify the size of the buffer.
Parsing Using Triple Buffering
Yes, you can buffer the buffer’s buffer. Why would you want to do this? One reason I can think of is to decouple parts of your code from a stream. However, unlike in C where you need to defined your functions passing c-strings, you can use call functions that are already accepting istreams and ostreams. To do this, you use a stringstream (bidirectional) , istringstream (input only) or ostringstream (output only). This can simplify your design and allows you to easily debug already existing systems that use C++ streams.
The following is a simple example of using a bidirectional stringstream.
[code=cpp]
#include <sstream>
#include <iostream>
using namespace std;
int main()
{
stringstream ss;
char buffer1[20] = {}, buffer2[20] = {};
ss << "hello there";
ss >> buffer1 >> buffer2;
cout << buffer1 << endl;
cout << buffer2 << endl;
}[/code]
Since stringstream inherits from istream and ostream, you can use it just like you would one of those classes. Further, istringstream inherits from istream and ostringstream inherits from ostream.
Binary files
As in the “How to Parse in C” document, binary files are beyond this documents scope. If there is enough interest, I will write about it in another document.
Conclusion
Parsing a file is not very difficult but certainly different then in C and though you can read in to a char array like in C, it is not recommended unless you have good reason to do so and take appropriate precautions.
If you have any questions or find anything unclear. Feel free to post a message and I will get back to you and/or update the document when I can.
Adrian
This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
Revision History:
25/05/2007 11:06 ADT
- Initial Post
26/05/2007 08:19 ADT
- Used wrong tag to close code block near end. Fixed
- Bolding not working in code block, needed to use comment and reference lines instead.
- Forgot to set state and throw exception when comma not found in CODE 1 & CODE 1A. Fixed.
- Title reference to “How to Parse a File in C” was wrong, Fixed.
- Made reference to BOOST.org as a good resource for libraries.
- Tried to make Parsing Using Triple Buffering clearer and highlight the differences compared to C.
- Updated Conclusion.
29/05/2007 12:23 ADT
- Removed reference to stdio buffer and replace with just buffer.
- Added FYI at beginning of document.
This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
Comment