Performance of hash_set vs. Java

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alex Gerdemann

    Performance of hash_set vs. Java

    Hello,

    I have spent a bunch of time converting a Java program I wrote to C++ in
    order to improve performance, and have found that it is not necessarily
    faster.

    Specifically, I'm writing a circuit simulator, so I must first parse the
    netlist file. The program goes through an input file, and makes a hash_set
    of unique nodes (std::string's) . The words are then sorted and numbered by
    copying the hash_set into a vector and sorting. Then, I run a series of
    binary searches on the vector to allocate lines in a matrix for each
    element.

    I've modified STL's lower_bound() to my getIndex() which functions like
    Java's equivalent by returning negative indices when the element is not
    found.

    Also, Java provides a HashSet class that can hash any kind of element it may
    contain. Since, the STL only hashes selected items I had to convert my
    std::strings to const char*'s for hashing.

    Other than that, my code in the Java and C++ is quite similar. However, the
    parse of a 100k line file took around a second for Java, but more like 17s
    for C++ . In fact, using std::set instead of hash_set actually improved
    performance to about 15s in C++.

    Any idea why there would be such a difference? In particular, I suspect
    that my hash function may be slow from the call to c_str(). Does this have
    to allocate new memory or just point to somewhere inside the object? I'm
    kind of curious what Java actually uses for its hash. Any general
    performace/other criticisms on my scheme are certainly welcome.

    On the plus side there is a great direct sparse linear solver for C. I've
    only been able to find a iterative sparse solver for Java. This makes the
    C++ code overall much faster than Java, but I suspect that the parse stage
    should run at least as fast or faster.

    Thanks,

    Alex Gerdemann
    University of Illinois at Urbana-Champaign

    For reference, I've included a sketch of the code

    So I did something like the following:

    struct hashSet {
    __gnu_css::hash <const char*> h;

    bool operator()(std: :string& s) const {
    return h(s.ctr());
    }
    };

    inline int getIndex(const std::vector<std ::string>& list, const std::string&
    item) {
    std::vector<std ::string>::cons t_iterator i =
    std::lower_boun d(list.begin(), list.end(), item);
    if(i==list.end( )) {
    return -(static_cast<in t>(i-list.begin())+1 );
    } else if(*i != item) {
    return -(static_cast<in t>(i-list.begin())+1 );
    }
    return static_cast<int >(i-list.begin());
    }

    __gnu_cxx::hash _set<std::strin g, hashString> nodeSet;
    std::vector<std ::string> nodeVector;

    while(!done) {
    //read a line of file
    //split the line into words and push into a vector
    //check for syntax errors
    nextNode = some word in line;
    nodeSet.insert( nextNode);
    }
    std::copy(nodeS et.begin(),node .end(),std::bac k_inserter(node Vector));
    std::sort(nodeV ector.begin(),n odeVector.end() );
    while(!done) {
    //read line from previously stored words
    //more error checking
    thisNode = some word in line
    getIndex(nodeVe ctor, thisNode);
    //allocate an object representing element containing proper indices
    representing nodes
    //push pointer to element into a vector
    }


  • Nils O. Selåsdal

    #2
    Re: Performance of hash_set vs. Java

    Alex Gerdemann wrote:[color=blue]
    > Hello,
    > struct hashSet {
    > __gnu_css::hash <const char*> h;
    >
    > bool operator()(std: :string& s) const {
    > return h(s.ctr());
    > }
    > };
    >
    > inline int getIndex(const std::vector<std ::string>& list, const std::string&
    > item) {
    > std::vector<std ::string>::cons t_iterator i =
    > std::lower_boun d(list.begin(), list.end(), item);
    > if(i==list.end( )) {
    > return -(static_cast<in t>(i-list.begin())+1 );
    > } else if(*i != item) {
    > return -(static_cast<in t>(i-list.begin())+1 );
    > }
    > return static_cast<int >(i-list.begin());
    > }
    >
    > __gnu_cxx::hash _set<std::strin g, hashString> nodeSet;
    > std::vector<std ::string> nodeVector;[/color]
    When you use std::string , you will have *alot* of string copying
    going on. Java will just pass its "refrece"/"pointer" along.
    Try convert lists, etc. to hold std::string* (pointers )rather

    Comment

    • Tom Widmer

      #3
      Re: Performance of hash_set vs. Java

      On Mon, 11 Oct 2004 11:56:38 GMT, "Alex Gerdemann"
      <null_soup@hotm ail.com> wrote:
      [color=blue]
      >Hello,
      >
      >I have spent a bunch of time converting a Java program I wrote to C++ in
      >order to improve performance, and have found that it is not necessarily
      >faster.[/color]

      No, it probably won't be unless you use a better class than
      std::string to hold the strings. java.lang.Strin g is in several ways
      more efficient than typical implementations of std::string.
      [color=blue]
      >Specifically , I'm writing a circuit simulator, so I must first parse the
      >netlist file. The program goes through an input file, and makes a hash_set
      >of unique nodes (std::string's) . The words are then sorted and numbered by
      >copying the hash_set into a vector and sorting. Then, I run a series of
      >binary searches on the vector to allocate lines in a matrix for each
      >element.
      >
      >I've modified STL's lower_bound() to my getIndex() which functions like
      >Java's equivalent by returning negative indices when the element is not
      >found.
      >
      >Also, Java provides a HashSet class that can hash any kind of element it may
      >contain. Since, the STL only hashes selected items I had to convert my
      >std::strings to const char*'s for hashing.[/color]

      The main benefit Java has in hashing is that Strings cache their
      hashcodes, so they need only be calculated once. std::string knows
      nothing about hashing, and therefore doesn't perform this
      optimization.
      [color=blue]
      >Other than that, my code in the Java and C++ is quite similar. However, the
      >parse of a 100k line file took around a second for Java, but more like 17s
      >for C++ . In fact, using std::set instead of hash_set actually improved
      >performance to about 15s in C++.[/color]

      How long did it take just to read in the file? I don't know much about
      GCC's hash_set, but I assume it is configurable for bucket size, etc.
      You might want to tweak this. What version of GCC are you using?
      [color=blue]
      >Any idea why there would be such a difference? In particular, I suspect
      >that my hash function may be slow from the call to c_str(). Does this have
      >to allocate new memory or just point to somewhere inside the object?[/color]

      Sometimes is has to do the former, due to reference counting problems,
      but usually it just returns a pointer to the storage. Tracing in using
      the debugger would tell you what's going on in the hash calls.

      I'm[color=blue]
      >kind of curious what Java actually uses for its hash. Any general
      >performace/other criticisms on my scheme are certainly welcome.[/color]

      Java just iterates over the whole String, generating the hash value,
      and then caches it for future calls.
      [color=blue]
      >On the plus side there is a great direct sparse linear solver for C. I've
      >only been able to find a iterative sparse solver for Java. This makes the
      >C++ code overall much faster than Java, but I suspect that the parse stage
      >should run at least as fast or faster.
      >
      >Thanks,
      >
      >Alex Gerdemann
      >University of Illinois at Urbana-Champaign
      >
      >For reference, I've included a sketch of the code
      >
      >So I did something like the following:
      >
      >struct hashSet {
      > __gnu_css::hash <const char*> h;
      >
      > bool operator()(std: :string& s) const {
      > return h(s.ctr());
      > }[/color]

      I doubt that is your bottleneck, but if the strings are long, then
      Java's hashcode caching might be giving it a major advantage.
      [color=blue]
      >};
      >
      >inline int getIndex(const std::vector<std ::string>& list, const std::string&
      >item) {
      > std::vector<std ::string>::cons t_iterator i =
      >std::lower_bou nd(list.begin() , list.end(), item);
      > if(i==list.end( )) {
      > return -(static_cast<in t>(i-list.begin())+1 );
      > } else if(*i != item) {
      > return -(static_cast<in t>(i-list.begin())+1 );
      > }
      > return static_cast<int >(i-list.begin());
      >}
      >
      >__gnu_cxx::has h_set<std::stri ng, hashString> nodeSet;
      >std::vector<st d::string> nodeVector;
      >
      >while(!done) {
      > //read a line of file
      > //split the line into words and push into a vector
      > //check for syntax errors
      > nextNode = some word in line;
      > nodeSet.insert( nextNode);[/color]

      The above code may be where your main bottleneck is. How do you read
      the line and then split it into words? If you are creating a number of
      temporary strings, that will be slowing you down. You might consider
      using a fixed vector or two and reusing them, to avoid temporaries.
      [color=blue]
      >}[/color]

      Here you definitely want:
      nodeVector.rese rve(nodeSet.siz e());
      [color=blue]
      >std::copy(node Set.begin(),nod e.end(),std::ba ck_inserter(nod eVector));
      >std::sort(node Vector.begin(), nodeVector.end( ));[/color]

      Both of those operations should be fast if you are using a reference
      counted string class. I believe GCC 3+ uses such a beast.
      [color=blue]
      >while(!done) {
      > //read line from previously stored words
      > //more error checking
      > thisNode = some word in line
      > getIndex(nodeVe ctor, thisNode);
      > //allocate an object representing element containing proper indices
      >representing nodes
      > //push pointer to element into a vector
      >}[/color]

      Overall, I think your best bet would be to benchmark smaller bits of
      the code or put it through a profiler (-gprof IIRC, which I probably
      don't), to find out where the bottleneck actually lies.

      Tom

      Comment

      • Ivan Vecerina

        #4
        Re: Performance of hash_set vs. Java

        "Alex Gerdemann" <null_soup@hotm ail.com> wrote in message
        news:Wxuad.3699 33$Fg5.53609@at tbi_s53...[color=blue]
        > I have spent a bunch of time converting a Java program I wrote to C++ in
        > order to improve performance, and have found that it is not necessarily
        > faster.[/color]
        This can often be the case. Some native Java features can be more efficient
        that the C++ equivalents (which sometimes are designed for more
        flexibility).
        This is especially true for C++ streams.
        [color=blue]
        > Specifically, I'm writing a circuit simulator, so I must first parse the
        > netlist file. The program goes through an input file, and makes a
        > hash_set of unique nodes (std::string's) . The words are then sorted and
        > numbered by copying the hash_set into a vector and sorting. Then, I run a
        > series of binary searches on the vector to allocate lines in a matrix for
        > each element.[/color]
        As someone already pointed out, copying strings can be expensive in C++.
        An alternative would be to keep the originally read std::string,
        then use only a "const char*" pointer obtained using the string::c_str()
        member function (I'd do this rather than use a std::string*).
        [color=blue]
        > I've modified STL's lower_bound() to my getIndex() which functions like
        > Java's equivalent by returning negative indices when the element is not
        > found.[/color]
        This wrapper is an unnecessary Java-ism, but ok,
        it should be irrelevant to performance.
        [color=blue]
        > Also, Java provides a HashSet class that can hash any kind of element it
        > may contain. Since, the STL only hashes selected items I had to convert
        > my std::strings to const char*'s for hashing.[/color]
        All C++ hash_set implementations I've seen allow users to provide
        a custom hash function as an additional parameter. But this shouldn't
        be a big problem.
        [color=blue]
        > Other than that, my code in the Java and C++ is quite similar. However,
        > the parse of a 100k line file took around a second for Java, but more like
        > 17s for C++ . In fact, using std::set instead of hash_set actually
        > improved performance to about 15s in C++.[/color]
        Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
        has to be implementation-specific.
        However, you may want to see if you can pre-allocate or predefine the size
        of your hash table. Also, as previously pointed out, beware of std::string
        instances being copied.
        [color=blue]
        > Any idea why there would be such a difference? In particular, I suspect
        > that my hash function may be slow from the call to c_str(). Does this
        > have to allocate new memory or just point to somewhere inside the object?
        > I'm kind of curious what Java actually uses for its hash. Any general
        > performace/other criticisms on my scheme are certainly welcome.[/color]
        Are you sure that the hashing is the performance bottleneck?
        C++ i/o streams could also be a likely cause for the weak performance.

        All in all, I am very confident that better performance can be achieved
        in C++ than in Java. But in some cases, especially string and file i/o,
        this can require extra work :(

        Unfortunately, the code you posted shows very little about the file
        reading code and hash table, so we can't really help there.
        Regarding the vector and sorting only, I can say that using const char*
        instead of std::string is likely to improve execution speed.

        Cheers -Ivan
        --
        http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form


        Comment

        • Alex Gerdemann

          #5
          Re: Performance of hash_set vs. Java

          Tom wrote:[color=blue][color=green]
          >>Other than that, my code in the Java and C++ is quite similar. However,
          >>the
          >>parse of a 100k line file took around a second for Java, but more like 17s
          >>for C++ . In fact, using std::set instead of hash_set actually improved
          >>performance to about 15s in C++.[/color]
          >
          > How long did it take just to read in the file? I don't know much about
          > GCC's hash_set, but I assume it is configurable for bucket size, etc.
          > You might want to tweak this. What version of GCC are you using?
          >[/color]

          I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
          how much time is spent just reading the file, as I process it line by line,
          doing some work between each I/O call.
          [color=blue][color=green]
          >>while(!done ) {
          >> //read a line of file
          >> //split the line into words and push into a vector
          >> //check for syntax errors
          >> nextNode = some word in line;
          >> nodeSet.insert( nextNode);[/color]
          >
          > The above code may be where your main bottleneck is. How do you read
          > the line and then split it into words? If you are creating a number of
          > temporary strings, that will be slowing you down. You might consider
          > using a fixed vector or two and reusing them, to avoid temporaries.
          >[/color]

          Since you asked here's exactly what I did:

          std::vector< std::vector< std::string > > lines;
          std::string line;
          std::vector<std ::string> thisLine;

          do {
          std::getline(fi le,line);
          if (line.length() != 0) {
          split(line,this Line);
          lines.push_back (thisLine);
          //process the line
          } while(!netlist. eof());

          inline void split(const std::string& line, std::vector<std ::string>& words)
          {
          unsigned int firstMark = 0, lastMark = 0;
          words.clear();
          while(lastMark! =std::string::n pos) {
          firstMark=line. find_first_not_ of(" ",lastMark) ;
          if(firstMark==s td::string::npo s) break;
          lastMark=line.f ind_first_of(" ",firstMark );
          words.push_back (line.substr(fi rstMark,lastMar k-firstMark));
          }
          }

          [color=blue][color=green]
          >>}[/color]
          >
          > Here you definitely want:
          > nodeVector.rese rve(nodeSet.siz e());
          >[color=green]
          >>std::copy(nod eSet.begin(),no de.end(),std::b ack_inserter(no deVector));
          >>std::sort(nod eVector.begin() ,nodeVector.end ());[/color]
          >
          > Both of those operations should be fast if you are using a reference
          > counted string class. I believe GCC 3+ uses such a beast.
          >[/color]

          I added the reserve call which saved me a couple of tenths of a second.
          Nice, but must not be the major bottleneck. What is a "reference counted"
          class?
          [color=blue]
          > Overall, I think your best bet would be to benchmark smaller bits of
          > the code or put it through a profiler (-gprof IIRC, which I probably
          > don't), to find out where the bottleneck actually lies.[/color]

          I went ahead and tried this, but can't quite understand the results. The
          total time it computes is way off. (It says its 8s vs. the actual 14s).
          Also, it thinks that only 30% of run time was spent in the main() function.
          This can't possibly be right.

          -Alex Gerdemann
          University of Illinois Urbana-Champaign


          Comment

          • Alex Gerdemann

            #6
            Re: Performance of hash_set vs. Java

            Ivan wrote:[color=blue][color=green]
            >> I've modified STL's lower_bound() to my getIndex() which functions like
            >> Java's equivalent by returning negative indices when the element is not
            >> found.[/color]
            > This wrapper is an unnecessary Java-ism, but ok,
            > it should be irrelevant to performance.[/color]

            Well the specific negative value returned is unnecessary, but I do need to
            know if the string isn't found. It's kind of annoying that the function
            doesn't return this information because to find it myself I have to:

            1) check that the returned iterator does not point to set.end()
            2) check that the iterator points to what it claims.

            Certainly, the search algorithm knows internally if it found what it was
            looking for. Having to do that job again will certainly cost at least some
            time.

            The reason the node may not be in the list has to do with the details of
            circuit analysis. Specifically, the matrix describing the system should not
            include a row or column representing the ground node. So, my program
            collects all the nodes the user defines, searches for the ground node, and
            erases it from the list. That way, I can sort the list, and use the index
            of each node to allocate a line in the matrix. Since I've collected all the
            nodes, I know that if a particular node isn't found in the list, it must be
            the ground node. It's kind of an ugly mechanism, I guess, but I currently
            don't have a better idea.
            [color=blue][color=green]
            >> Also, Java provides a HashSet class that can hash any kind of element it
            >> Other than that, my code in the Java and C++ is quite similar. However,
            >> the parse of a 100k line file took around a second for Java, but more
            >> like 17s for C++ . In fact, using std::set instead of hash_set actually
            >> improved performance to about 15s in C++.[/color]
            > Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
            > has to be implementation-specific.
            > However, you may want to see if you can pre-allocate or predefine the size
            > of your hash table. Also, as previously pointed out, beware of std::string
            > instances being copied.[/color]

            I would like to convert the vector to store pointers to strings, rather than
            the strings themselves, but then I cannot search use the built in sort, and
            binary searches to find a particular string efficiently.
            [color=blue]
            > Unfortunately, the code you posted shows very little about the file
            > reading code and hash table, so we can't really help there.
            > Regarding the vector and sorting only, I can say that using const char*
            > instead of std::string is likely to improve execution speed.[/color]

            I didn't write my own hash table. I though hash_set handled this problem on
            its own. Not coming from a CS background, I don't know how to write a good
            hash scheme, and it seems like this is the sort of thing that should be
            provided by the built in libraries. Since it works faster, for now, I've
            switched back to the regular (tree?) set. I want to use a set so repeat
            copies of nodes will not be duplicated when added to the set. I actually
            don't do any of the searches until after the set is copied into a vector.
            Given that, should I actually expect the hash_set to have a faster insertion
            time? Not having a CS background, I just read Java's documentation which
            says that a hash set is faster in most cases.

            On the I/O code, I posted this in my other reply, but here's another copy:

            std::vector< std::vector< std::string > > lines;
            std::string line;
            std::vector<std ::string> thisLine;

            do {
            std::getline(fi le,line);
            if (line.length() != 0) {
            split(line,this Line);
            lines.push_back (thisLine);
            //process the line
            } while(!netlist. eof());

            inline void split(const std::string& line, std::vector<std ::string>& words)
            {
            unsigned int firstMark = 0, lastMark = 0;
            words.clear();
            while(lastMark! =std::string::n pos) {
            firstMark=line. find_first_not_ of(" ",lastMark) ;
            if(firstMark==s td::string::npo s) break;
            lastMark=line.f ind_first_of(" ",firstMark );
            words.push_back (line.substr(fi rstMark,lastMar k-firstMark));
            }
            }

            Thanks for the tips,

            -Alex Gerdemann
            University of Illinois Urbana-Champaign


            Comment

            • Ivan Vecerina

              #7
              Re: Performance of hash_set vs. Java

              "Alex Gerdemann" <null_soup@hotm ail.com> wrote in message
              news:%WTad.2344 54$D%.43700@att bi_s51...[color=blue]
              > Ivan wrote:[color=green][color=darkred]
              >>> I've modified STL's lower_bound() to my getIndex() which functions like
              >>> Java's equivalent by returning negative indices when the element is not
              >>> found.[/color]
              >> This wrapper is an unnecessary Java-ism, but ok,
              >> it should be irrelevant to performance.[/color]
              >
              > Well the specific negative value returned is unnecessary, but I do need to
              > know if the string isn't found. It's kind of annoying that the function
              > doesn't return this information because to find it myself I have to:
              >
              > 1) check that the returned iterator does not point to set.end()
              > 2) check that the iterator points to what it claims.
              >
              > Certainly, the search algorithm knows internally if it found what it was
              > looking for. Having to do that job again will certainly cost at least
              > some
              > time.[/color]
              Actually lower_bound may will not test that the hit value is actually
              equal to the one being looked for (because it only uses '<' for comparison)
              but I agree the interface is cumbersome.
              std::equal_rang e could be an alternative...
              [color=blue][color=green][color=darkred]
              >>> Also, Java provides a HashSet class that can hash any kind of element it
              >>> Other than that, my code in the Java and C++ is quite similar. However,
              >>> the parse of a 100k line file took around a second for Java, but more
              >>> like 17s for C++ . In fact, using std::set instead of hash_set actually
              >>> improved performance to about 15s in C++.[/color]
              >> Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
              >> has to be implementation-specific.
              >> However, you may want to see if you can pre-allocate or predefine the
              >> size
              >> of your hash table. Also, as previously pointed out, beware of
              >> std::string
              >> instances being copied.[/color]
              >
              > I would like to convert the vector to store pointers to strings, rather
              > than
              > the strings themselves, but then I cannot search use the built in sort,
              > and
              > binary searches to find a particular string efficiently.[/color]

              Actually you can, but you will need to provide these algorithms with an
              additional parameter, which is a comparison function:

              // add this declaration
              struct StrPtrCompare
              {
              bool operator()(char const* a, char const* b) const
              { return std::strcmp( a, b ) < 0; }
              };

              //and from your function call call:
              std::sort( vect.begin(), vect.end(), StrPtrCompare() );

              [color=blue][color=green]
              >> Unfortunately, the code you posted shows very little about the file
              >> reading code and hash table, so we can't really help there.
              >> Regarding the vector and sorting only, I can say that using const char*
              >> instead of std::string is likely to improve execution speed.[/color]
              >
              > I didn't write my own hash table.[/color]

              I did not suggest you should. Just some implementations of hash_set
              have an equivalent of vector::reserve () to preallocate a larger
              table (and avoid later reallocations). But no big deal...
              [color=blue]
              > copies of nodes will not be duplicated when added to the set. I actually
              > don't do any of the searches until after the set is copied into a vector.
              > Given that, should I actually expect the hash_set to have a faster
              > insertion
              > time? Not having a CS background, I just read Java's documentation which
              > says that a hash set is faster in most cases.[/color]
              What may be faster is to first copy and sort everything into a vector,
              then look for (contiguous) duplicate items and remove them.
              [ since you need a sorted vector anyway, the hash may be redundant ]
              [color=blue]
              > On the I/O code, I posted this in my other reply, but here's another copy:
              >
              > std::vector< std::vector< std::string > > lines;[/color]

              Unfortunately, such a data structure can be inefficient in C++,
              and involve many object copies an memory allocations.
              But there are a few tricks that can help....
              [color=blue]
              > std::string line;
              > std::vector<std ::string> thisLine;
              >
              > do {
              > std::getline(fi le,line);
              > if (line.length() != 0) {
              > split(line,this Line);
              > lines.push_back (thisLine);[/color]

              Instead of the last two lines, the following will avoid
              an intermediate copy of the objects and sub-objects:
              lines.push_back ( std::vector<std ::string>() );
              split( line, lines.back() );

              Also, it will help if you call
              lines.reserve( someGuessOfTheN umberOfInputLin es );
              prior to reading the file.
              Alternatively, you could use a different type:
              std::vector< std::vector< std::string > > lines;
              // will be faster on some platforms.
              [color=blue]
              > //process the line
              > } while(!netlist. eof());
              >
              > inline void split(const std::string& line, std::vector<std ::string>&
              > words)
              > {
              > unsigned int firstMark = 0, lastMark = 0;
              > words.clear();
              > while(lastMark! =std::string::n pos) {
              > firstMark=line. find_first_not_ of(" ",lastMark) ;
              > if(firstMark==s td::string::npo s) break;
              > lastMark=line.f ind_first_of(" ",firstMark );
              > words.push_back (line.substr(fi rstMark,lastMar k-firstMark));
              > }
              > }[/color]

              Now not talking about efficiency, all the input code above probably
              could be simplified as follows (including <sstream> and <iterator>):

              while( std::getline(fi le,line) )
              {
              lines.push_back ( std::vector<std ::string>() ); // add empty object
              lines.back().as sign(
              std::istream_it erator<std::str ing>( istringstream(l ine) )
              std::istream_it erator<std::str ing>() );
              }

              This is the C++ input code I would start with.
              [ for the rest, the tips above - use char* after this input code
              and skip the hash if possible - should improve execution speed ]


              If the reading code remains a critical bottleneck, the
              ultimate solution would be to:
              1) skip iostreams, and use memory mapping or a single fread
              to bring the whole file into memory
              2) parse the file in-place, and add null chars at the end of
              each 'word', so I can use a simple char* to access all
              strings in-place, without any memory allocation.
              That would take me an extra day of programming and be less
              portable/maintainable, but be blazingly fast...



              Well... I hope some of this stuff will be useful.
              [I'm a bit in a rush]

              Cheers,
              Ivan
              --
              http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form


              Comment

              • Ivan Vecerina

                #8
                Re: Performance of hash_set vs. Java

                I (Ivan Vecerina) wrote in message news:ckh97b$sna $1@newshispeed. ch...[color=blue][color=green]
                > > copies of nodes will not be duplicated when added to the set. I[/color][/color]
                actually[color=blue][color=green]
                > > don't do any of the searches until after the set is copied into a[/color][/color]
                vector.[color=blue][color=green]
                > > Given that, should I actually expect the hash_set to have a faster
                > > insertion
                > > time? Not having a CS background, I just read Java's documentation[/color][/color]
                which[color=blue][color=green]
                > > says that a hash set is faster in most cases.[/color]
                > What may be faster is to first copy and sort everything into a vector,
                > then look for (contiguous) duplicate items and remove them.
                > [ since you need a sorted vector anyway, the hash may be redundant ][/color]
                NB: however this only makes sense if there are few duplicate strings to
                detect.
                [color=blue]
                > Well... I hope some of this stuff will be useful.[/color]
                And of course it is difficult to suggest solutions
                without seeing the big picture...

                Regards
                -Ivan
                --
                http://ivan.vecerina.com/contact/?subject=NG_POST <- e-mail contact form


                Comment

                • Tom Widmer

                  #9
                  Re: Performance of hash_set vs. Java

                  On Tue, 12 Oct 2004 16:28:16 GMT, "Alex Gerdemann"
                  <null_soup@hotm ail.com> wrote:
                  [color=blue]
                  >I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
                  >how much time is spent just reading the file, as I process it line by line,
                  >doing some work between each I/O call.[/color]

                  Ok, I'm 95% sure that gcc 3.3 uses a copy-on-write (COW) string
                  implementation, with reference counting (gcc 3.4 certainly does). This
                  means that copying a string doesn't require that any memory is
                  allocated - the copy shares representation with the original and only
                  creates its own unique copy when it might modify it. This also means
                  that copying strings is cheap, as long as those strings are only
                  accessed through const member functions after the copy has occurred.
                  [color=blue][color=green][color=darkred]
                  >>>while(!don e) {
                  >>> //read a line of file
                  >>> //split the line into words and push into a vector
                  >>> //check for syntax errors
                  >>> nextNode = some word in line;
                  >>> nodeSet.insert( nextNode);[/color]
                  >>
                  >> The above code may be where your main bottleneck is. How do you read
                  >> the line and then split it into words? If you are creating a number of
                  >> temporary strings, that will be slowing you down. You might consider
                  >> using a fixed vector or two and reusing them, to avoid temporaries.
                  >>[/color]
                  >
                  >Since you asked here's exactly what I did:
                  >
                  >std::vector< std::vector< std::string > > lines;[/color]

                  You definitely want:
                  lines.reserve(e stimateOfNumber OfLines);
                  [color=blue]
                  >std::string line;
                  >std::vector<st d::string> thisLine;
                  >
                  >do {[/color]

                  line.reserve(en oughForALine); //may well help
                  [color=blue]
                  > std::getline(fi le,line);
                  > if (line.length() != 0) {
                  > split(line,this Line);
                  > lines.push_back (thisLine);[/color]

                  The above line is going to be slow, since it involves copying the
                  whole vector. Instead, you might do:
                  //avoid coping the vector:
                  lines.resize(li nes.size() + 1); //add extra default element
                  lines.back().sw ap(thisLine); //swap it with the current element
                  [color=blue]
                  > //process the line
                  >} while(!netlist. eof());
                  >
                  >inline void split(const std::string& line, std::vector<std ::string>& words)
                  >{
                  > unsigned int firstMark = 0, lastMark = 0;
                  > words.clear();[/color]

                  words.reserve(5 0); //say
                  [color=blue]
                  > while(lastMark! =std::string::n pos) {
                  > firstMark=line. find_first_not_ of(" ",lastMark) ;
                  > if(firstMark==s td::string::npo s) break;
                  > lastMark=line.f ind_first_of(" ",firstMark );
                  > words.push_back (line.substr(fi rstMark,lastMar k-firstMark));[/color]

                  That might be slightly more efficient as:

                  words.push_back (std::string(li ne, firstMark, lastMark - firstMark));

                  but I doubt it will make much difference with a COW string
                  implementation.
                  [color=blue]
                  > }
                  >}
                  >
                  >[color=green][color=darkred]
                  >>>}[/color]
                  >>
                  >> Here you definitely want:
                  >> nodeVector.rese rve(nodeSet.siz e());
                  >>[color=darkred]
                  >>>std::copy(no deSet.begin(),n ode.end(),std:: back_inserter(n odeVector));
                  >>>std::sort(no deVector.begin( ),nodeVector.en d());[/color]
                  >>
                  >> Both of those operations should be fast if you are using a reference
                  >> counted string class. I believe GCC 3+ uses such a beast.
                  >>[/color]
                  >
                  >I added the reserve call which saved me a couple of tenths of a second.
                  >Nice, but must not be the major bottleneck. What is a "reference counted"
                  >class?[/color]

                  See above.
                  [color=blue]
                  >[color=green]
                  >> Overall, I think your best bet would be to benchmark smaller bits of
                  >> the code or put it through a profiler (-gprof IIRC, which I probably
                  >> don't), to find out where the bottleneck actually lies.[/color]
                  >
                  >I went ahead and tried this, but can't quite understand the results. The
                  >total time it computes is way off. (It says its 8s vs. the actual 14s).
                  >Also, it thinks that only 30% of run time was spent in the main() function.
                  >This can't possibly be right.[/color]

                  It might not be counting time spent in system functions, such as IO
                  (system vs user time?).

                  If you want to optimize the above, I'd do this:

                  Write a simple immutable string type that is initialized with a
                  pointer and a length, but allocate no memory and does nothing in the
                  destructor. Include in the string a cached hashcode value, so that the
                  hashcode need only be calculated once. e.g.

                  class mystring
                  {
                  char const* m_ptr;
                  std::size_t m_length;
                  mutable unsigned long m_hashCode;
                  public:
                  mystring(char const* ptr, std::size_t length)
                  :m_ptr(ptr), m_length(length ),
                  m_hashCode(stat ic_cast<unsigne d long>(-1))
                  {
                  }

                  unsigned long hashCode() const
                  {
                  if (m_hashCode == static_cast<uns igned long>(-1))
                  {
                  //calculate hashCode (copy java.lang.Strin g code?)
                  }
                  return m_hashCode;
                  }

                  char operator[](std::size_t index) const
                  {
                  return m_ptr[index];
                  }

                  //operator==, <, etc.
                  //compiler generated destructor, copy, assignment are fine.
                  };

                  Read the entire file into a vector<char>.

                  Iterate over the vector, adding creating "mystring"s pointing into the
                  vector for each word. You might also consider replacing ' ' characters
                  with '\0's, so that you can add c_str() method to mystring that simply
                  looks like this:
                  char const* c_str() const
                  {
                  return m_ptr;
                  }

                  Operate on this new vector<vector<m ystring> >.

                  That should be much much faster, since the memory allocation overhead
                  will be vastly decreased. If you don't need a vector of lines, just
                  have an overall vector of words for another speed up. Essentially,
                  optimization in non-numerical C++ is often about reducing the number
                  of calls to "new" and "delete", which are often even slower than the
                  Java versions (new + gc).

                  Tom

                  Comment

                  • Karl Heinz Buchegger

                    #10
                    Re: Performance of hash_set vs. Java

                    Alex Gerdemann wrote:[color=blue]
                    >[/color]
                    [snip][color=blue]
                    > Other than that, my code in the Java and C++ is quite similar. However, the
                    > parse of a 100k line file took around a second for Java, but more like 17s
                    > for C++ . In fact, using std::set instead of hash_set actually improved
                    > performance to about 15s in C++.[/color]

                    Hmm. Just a quick question:
                    Are you sure the C++ optimizer run over your code and you are not
                    timing a debug version?

                    Especially with container templates the optimizer can often dramatically
                    reduce execution time.

                    --
                    Karl Heinz Buchegger
                    kbuchegg@gascad .at

                    Comment

                    • Tom Widmer

                      #11
                      Re: Performance of hash_set vs. Java

                      On Wed, 13 Oct 2004 15:59:23 +0200, Karl Heinz Buchegger
                      <kbuchegg@gasca d.at> wrote:
                      [color=blue]
                      >Alex Gerdemann wrote:[color=green]
                      >>[/color]
                      >[snip][color=green]
                      >> Other than that, my code in the Java and C++ is quite similar. However, the
                      >> parse of a 100k line file took around a second for Java, but more like 17s
                      >> for C++ . In fact, using std::set instead of hash_set actually improved
                      >> performance to about 15s in C++.[/color]
                      >
                      >Hmm. Just a quick question:
                      >Are you sure the C++ optimizer run over your code and you are not
                      >timing a debug version?
                      >
                      >Especially with container templates the optimizer can often dramatically
                      >reduce execution time.[/color]

                      Good point. For GCC, use -O3 as a minimum (you can add architecture
                      specific optimizations too if you like).

                      Tom

                      Comment

                      Working...