Tokenizer Function (plus rant on strtok documentation)

**jmoy** · Jul 11 '06, 05:05 AM

Re: Tokenizer Function (plus rant on strtok documentation)

Robbie Hatley wrote:

A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
>This is how this function REALLY
works:
>

strtok

http://www.opengroup.org/onlinepubs/007908799/xsh/strtok_r.html

>
I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I'm done ranting now.
>

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

>
For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std ::string. I'm sure there's
various ways this could be improved. Comments? Slings? Arrows?
>
>
void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std ::string & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str() , StrSize);
>
// Clear the Tokens vector:
Tokens.clear();
>
// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_st r())))
{
Tokens.push_bac k(std::string(T okenPtr));
TempPtr = NULL;
}
>
// Free memory and scram:
delete[] Ptr;
return;
}
>

I guess tying the tokenizer to vector<stringis not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterato rs. Here is my attempt, which also gets
rid of strtok:

#include <string>
using namespace std;
template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_ty pe Sz;

Sz begin=0;
while(begin<str .size()){
Sz end=str.find_fi rst_of(delim,be gin);
*oi++=str.subst r(begin,end-begin);
begin=str.find_ first_not_of(de lim,end);
}
}

I use find_first_not_ of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

**sonison.james@gmail.com** · Jul 11 '06, 05:55 AM

Re: Tokenizer Function (plus rant on strtok documentation)

jmoy wrote:

Robbie Hatley wrote:

A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

strtok

http://www.opengroup.org/onlinepubs/007908799/xsh/strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I'm done ranting now.

>
strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std ::string. I'm sure there's
various ways this could be improved. Comments? Slings? Arrows?

void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std ::string & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str() , StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_st r())))
{
Tokens.push_bac k(std::string(T okenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}

>
I guess tying the tokenizer to vector<stringis not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterato rs. Here is my attempt, which also gets
rid of strtok:
>
#include <string>
using namespace std;
template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_ty pe Sz;
>
Sz begin=0;
while(begin<str .size()){
Sz end=str.find_fi rst_of(delim,be gin);
*oi++=str.subst r(begin,end-begin);
begin=str.find_ first_not_of(de lim,end);
}
}
>
I use find_first_not_ of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

check out http://www.boost.org/libs/tokenizer/index.html
One cool thing about the boost tokenizer is that you can get NULL
tokens if you have adjacent separators, which I believe can't be
handled by strtok.

Thanks and regards
SJ

**Mehturt@gmail.com** · Jul 11 '06, 05:55 AM

Re: Tokenizer Function (plus rant on strtok documentation)

jmoy wrote:

Robbie Hatley wrote:

A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

strtok

http://www.opengroup.org/onlinepubs/007908799/xsh/strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I'm done ranting now.

>
strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std ::string. I'm sure there's
various ways this could be improved. Comments? Slings? Arrows?

void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std ::string & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str() , StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_st r())))
{
Tokens.push_bac k(std::string(T okenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}

>
I guess tying the tokenizer to vector<stringis not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterato rs. Here is my attempt, which also gets
rid of strtok:
>
#include <string>
using namespace std;
template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_ty pe Sz;
>
Sz begin=0;
while(begin<str .size()){
Sz end=str.find_fi rst_of(delim,be gin);
*oi++=str.subst r(begin,end-begin);
begin=str.find_ first_not_of(de lim,end);
}
}

I like this implementation, but don't you assume the space for data
(tokens) is already pre-allocated?
If I use your fn with something like this, I get segmentation fault..

std::vector<std ::stringv;
std::vector<std ::string>::iter ator it = v.begin();
tokenize<std::v ector<std::stri ng>::iterator>( "a b c", " ", it);

>
I use find_first_not_ of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

**Robbie Hatley** · Jul 11 '06, 06:25 AM

Re: Tokenizer Function (plus rant on strtok documentation)

"jmoy" <jmoy.matecon@g mail.comwrote:

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

Ah, sort of like the code my ex-boss left me to maintain after he
got fired. Hundreds of global variables, which he uses to pass
data from function to function, like a dumbass. Of course, since
the program is a complex windows app with timers and interrupts,
the data often gets over-written on its way from one place to
another. ::sigh:: Global variables are the work of Sauron.

I guess tying the tokenizer to vector<stringis not a good idea.

It does limit the user to a std::vector<std ::string>, yes. However,
that construct is pretty good for this app. I find it hard to
think of cases which couldn't use that to hold a bunch of tokens.

If it took an output iterator it could be used with any container
or even with things like ostream_iterato rs.

Provided that the output container was big enough. If you start
with an empty conainer and try writing to it using output
iterators, you'll get an "illegal memory access" or "general
protection fault" or some such thing. So you'd have to make sure
that the container was huge. I don't like that approach.

#include <string>
using namespace std;
template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_ty pe Sz;
>
Sz begin=0;
while(begin<str .size()){
Sz end=str.find_fi rst_of(delim,be gin);
*oi++=str.subst r(begin,end-begin);
begin=str.find_ first_not_of(de lim,end);
}
}
>
I use find_first_not_ of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

Alluring in its simplicity, yes. But has two major bugs:

1. Memory corruption danger if used to write to a small container.
2. You don't take into account the fact that the string might START
with one or more delimiters.

Maybe something like THIS might be better:

#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::si ze_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_ not_of (delim, begin);
end = str.find_first_ of (delim, begin);
Container.push_ back(str.substr (begin, end-begin));
}
}

I haven't tested that, but I think something like that would work
better. It does require that the container for the tokens have
the push_back() method defined. Other than that, it's pretty
generic.

Note that to take care of the "starts with delimiters" case,
I simply moved your "first_not_ of" up to the top of the loop.
That should work nicely.

--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)

http://home.pacbell.net/earnur/

**Jerry Coffin** · Jul 11 '06, 06:35 AM

Re: Tokenizer Function (plus rant on strtok documentation)

In article <zBHsg.129441$d W3.67625@newssv r21.news.prodig y.com>,
bogus.address@n o.spam says...

[ ... ]

Maybe something like THIS might be better:
>
#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::si ze_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_ not_of (delim, begin);
end = str.find_first_ of (delim, begin);
Container.push_ back(str.substr (begin, end-begin));
}
}

IMO, this is a poor idea. Take an iterator for the output. If the
user wants the data pushed onto the back, they can use back_inserter
to get that. If they want it inserted into something like a set, they
can use inserter to get that.

--
Later,
Jerry.

The universe is a figment of its own imagination.

**Robbie Hatley** · Jul 11 '06, 07:05 AM

Re: Tokenizer Function (plus rant on strtok documentation)

"Jerry Coffin" <jcoffin@taeus. comwrote:

IMO, this is a poor idea. Take an iterator for the output.

Puts extreme burden on the user to provide the right kind of
container and iterator. Such a function would often get mis-used
and cause memory corruption and program crashes.

If the user wants the data pushed onto the back, they can
use back_inserter to get that.

If they know any better.

If they want it inserted into something like a set, they
can use inserter to get that.

If they know that they should, and if they know how.

So it really depends on which kind of function one wants to write:

1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.

or

2. Something safe and easy and self-documenting, but a bit limited.

I can see use for both, actually. But the iterator version will
always be the more dangerous one.

--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)

http://home.pacbell.net/earnur/

**Jerry Coffin** · Jul 11 '06, 07:25 AM

Re: Tokenizer Function (plus rant on strtok documentation)

In article <I5Isg.129443$d W3.1302@newssvr 21.news.prodigy .com>,
bogus.address@n o.spam says...

"Jerry Coffin" <jcoffin@taeus. comwrote:
>

IMO, this is a poor idea. Take an iterator for the output.

>
Puts extreme burden on the user to provide the right kind of
container and iterator.

IMO, it's not extreme at all. They're going to have to provide the
right kind of container in any case -- but the code you provided will
often _prevent_ them from using the right container. Just for
example, putting the output into a set might well make sense -- but
your code simply won't work with it at all.

[ ... ]

I can see use for both, actually. But the iterator version will
always be the more dangerous one.

The iterator version is the only one that really works. In any case,
for a programmer to become at all proficient in using C++, they need
to learn how to do this anyway -- look through most of the algorithms
in the standard library, and note that they also take an iterator to
tell them where to put the results -- with precisely the same result.

--
Later,
Jerry.

The universe is a figment of its own imagination.

**Alex Vinokur** · Jul 11 '06, 08:25 AM

Re: Tokenizer Function (plus rant on strtok documentation)

[snip]
sonison.james@g mail.com wrote:

check out http://www.boost.org/libs/tokenizer/index.html

[snip]

Also "Splitting string into vector of vectors":

http://groups.google.com/group/sources/msg/77993fb8841382c8

http://groups.google.com/group/perfo/msg/9d49a1be3a5c6335

http://groups.google.com/group/perfo/msg/8273f4d1a05cfbd1

Alex Vinokur
email: alex DOT vinokur AT gmail DOT com

http://mathforum.org/library/view/10978.html

alexvn / Profile

http://sourceforge.net/users/alexvn

**Alex Vinokur** · Jul 11 '06, 08:45 AM

Re: Tokenizer Function (plus rant on strtok documentation)

Alex Vinokur wrote:
[slip]

Also "Splitting string into vector of vectors":

http://groups.google.com/group/sources/msg/77993fb8841382c8

http://groups.google.com/group/perfo...49a1be3a5c6335

--------------------------------------------------
Instead of

http://groups.google.com/group/perfo...73f4d1a05cfbd1

should be

http://groups.google.com/group/perfo/msg/f3c775cf7e3cdcf0

Sorry
--------------------------------------------------
[snip]

Alex Vinokur
email: alex DOT vinokur AT gmail DOT com

http://mathforum.org/library/view/10978.html

alexvn / Profile

http://sourceforge.net/users/alexvn

**jmoy** · Jul 11 '06, 12:45 PM

Re: Tokenizer Function (plus rant on strtok documentation)

Robbie Hatley wrote:

"jmoy" <jmoy.matecon@g mail.comwrote:
>

#include <string>
using namespace std;

template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{

typedef string::size_ty pe Sz;

Sz begin=0;
while(begin<str .size()){
Sz end=str.find_fi rst_of(delim,be gin);
*oi++=str.subst r(begin,end-begin);
begin=str.find_ first_not_of(de lim,end);
}
}

...

Alluring in its simplicity, yes. But has two major bugs:
>
1. Memory corruption danger if used to write to a small container.

No. As mentioned by other posters, if you are adding tokens to a
container the right thing is to call the function with something like a
back_inserter in which case there is no memory corruption

2. You don't take into account the fact that the string might START
with one or more delimiters.

You are right. My mistake.

>
Maybe something like THIS might be better:
>
#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::si ze_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_ not_of (delim, begin);
end = str.find_first_ of (delim, begin);
Container.push_ back(str.substr (begin, end-begin));
}
}

The problem with this is that it fails for the reverse case of a string
ending with a delimiter. Also, I don't like the idea of tying
algorithms with containers---iterators are a much more general concept.
Here is a corrected version of my function:

template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_ty pe Sz;

Sz end=0;
for(;;){
Sz begin=str.find_ first_not_of(de lim,end);
if (begin==string: :npos)
break;
end=str.find_fi rst_of(delim,be gin);
*oi++=str.subst r(begin,end-begin);
}
}

**Jerry Coffin** · Jul 11 '06, 03:25 PM

Re: Tokenizer Function (plus rant on strtok documentation)

In article <1152622377.108 600.157680@m79g 2000cwm.googleg roups.com>,
jmoy.matecon@gm ail.com says...

[ ... ]

template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_ty pe Sz;
>
Sz end=0;
for(;;){
Sz begin=str.find_ first_not_of(de lim,end);
if (begin==string: :npos)
break;
end=str.find_fi rst_of(delim,be gin);
*oi++=str.subst r(begin,end-begin);
}
}

I think I'd also make the character type a template parameter:

template class<charT, class OIter>
void tokenize( basic_string<ch arTinput,
basic_string<ch arTdelim,
OIter oi)
{
typedef basic_string<ch arTstr;
typedef str::size_type Sz;

Sz end = 0;
for (;;) {
Sz begin = input.find_firs t_not_of(delim, end);
if (str::npos == begin)
break;
end = input.find_firs t_of(delim, begin);
*oi++ = input.substr(be gin, end-begin);
}
}

This way, the code can work with strings of either narrow or wide
characters.

--
Later,
Jerry.

The universe is a figment of its own imagination.

**Noah Roberts** · Jul 11 '06, 04:25 PM

Re: Tokenizer Function (plus rant on strtok documentation)

Robbie Hatley wrote:

A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.

Every once in a while I also have this massocistic urge to torture
myself for no reason. Eventually though I tire of it and move on.

The stroke function is useless. It is insecure, works in insecure
types, and is a general pita to use. There are better options that are
much easier to use and have type safety. Look into string streams as a
much better alternative to stroke.

**Roland Pibinger** · Jul 11 '06, 06:55 PM

Re: Tokenizer Function (plus rant on strtok documentation)

On Tue, 11 Jul 2006 07:06:16 GMT, "Robbie Hatley"
<bogus.address@ no.spamwrote:

>So it really depends on which kind of function one wants to write:
>
>1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.
>
or
>
>2. Something safe and easy and self-documenting, but a bit limited.

The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp....daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:

#include <algorithm>

template <typename StringT, typename ContainerT>
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_t ype b = text.find_first _not_of(delim);
while (b != StringT::npos) {
typename StringT::size_t ype e(text.find_fir st_of(delim, b));
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::ite rator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first _not_of(delim, std::min(e, text.size()));
}
return num;
}

For std::vector as result container efficency can be increased with
reserve().

Best wishes,
Roland Pibinger

**Mehturt@gmail.com** · Jul 12 '06, 05:55 AM

Re: Tokenizer Function (plus rant on strtok documentation)

Roland Pibinger wrote:

On Tue, 11 Jul 2006 07:06:16 GMT, "Robbie Hatley"
<bogus.address@ no.spamwrote:

So it really depends on which kind of function one wants to write:

1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.

or

2. Something safe and easy and self-documenting, but a bit limited.

>
The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp....daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:
>
#include <algorithm>
>
template <typename StringT, typename ContainerT>
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_t ype b = text.find_first _not_of(delim);
while (b != StringT::npos) {
typename StringT::size_t ype e(text.find_fir st_of(delim, b));

I think e can be StringT::npos here and that would cause problems in
the line below..

StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::ite rator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first _not_of(delim, std::min(e, text.size()));
}
return num;
}
>
For std::vector as result container efficency can be increased with
reserve().
>
Best wishes,
Roland Pibinger