Hash table in C++? STL?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Kevin Goodsell

    #16
    Re: Hash table in C++? STL?

    Claudio Puviani wrote:
    [color=blue]
    > "Kevin Goodsell" <usenet2.spamfr ee.fusion@never box.com> wrote
    >[color=green]
    >>So why the prime table size? I can't see an obvious reason.
    >>If hash values are approximately uniformly distributed between
    >>0 and N, I'd think that ideal table sizes would be numbers that
    >>evenly divide N, because any remainder would split the table
    >>into two areas with different probabilities of any particular
    >>value hashing into them.[/color]
    >
    >
    > If the hash values are uniformly distributed (which is one characteristic of a
    > good hashing function), then any table size would be equivalent; i.e., there
    > would be no "ideal" size. The reason that prime numbers are used for table
    > sizes is that many hashing functions are NOT good and sometimes generate hash
    > values that are multiples of one or more numbers. Those values, modulo a
    > relatively non-prime number will cluster. Those same values modulo a relatively
    > prime number will uniformly span the entire set of 0..N-1. In other words,
    > primality is used as insurance against bad hashing functions. You can persuade
    > yourself of this by modeling it in a spreadsheet.
    >[/color]

    I thought it might be something like that. Although, I disagree that
    there's no "ideal" size with uniformly distributed hash values. I
    definitely think some sizes are better than others (though it may be a
    small difference). As a trivial example, suppose the range of hash
    values is 0 to 75, and the table size is 50. Hash values from 0 to 25,
    and also from 51 to 75, end up in the lower 25 slots, while only those
    hash values in the range 26 to 50 end up in the upper 25 slots. As a
    result, the probability of a key hashing to the first 25 slots is twice
    as high as the probability of hashing to the last 25.

    I doubt this makes any difference in practice, but it is what I had in
    mind in my previous post.

    -Kevin
    --
    My email address is valid, but changes periodically.
    To contact me please use the address from a recent posting.

    Comment

    • pembed2003

      #17
      Re: Hash table in C++? STL?

      Michiel.Salters @logicacmg.com (Michiel Salters) wrote in message news:<fcaee77e. 0404150102.1f08 d3dd@posting.go ogle.com>...[color=blue]
      > pembed2003@yaho o.com (pembed2003) wrote in message news:<db593abd. 0404141727.4137 8bf2@posting.go ogle.com>...
      >[color=green]
      > > Hi John,
      > > Can you point me to the documentation where I can find the map and
      > > hash_map data structure? Thanks for the help![/color]
      >
      > std::map is documented in quite a number of books, e.g. Accelerated C++
      > (Koenig&Moo) or Generic Programming and the STL(Austern)
      >
      > http://www.amazon.com/exec/obidos/ASIN/020170353X/
      > http://www.amazon.com/exec/obidos/ASIN/0201309564/
      >
      > hash_map isn't in the standard, so there are multiple different
      > versions around. Get the documentation from the same source as
      > your code. Your compiler vendor may have included something.
      >[/color]

      Thanks for everyone's help so it looks like map is the standard and
      should be available for any platform? But hash_map is not standard.

      The reason I am asking for hash table is because I need to write a
      function that finds the overlap of two integer array. I am currently
      doing:

      #include<stdlib .h>
      #include<string .h>
      #include<sys/int_limits.h>

      #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

      void find(int* one,int c1,int* two,int c2){
      int i;
      int* buffer = malloc(SIZE);
      memset((void*)b uffer,0,SIZE);
      for(i = 0; i < c1; i++)
      buffer[one[i] + INT16_MAX] = 1;
      for(i = 0; i < c2; i++)
      if(buffer[two[i] + INT16_MAX])
      printf("%d\n",t wo[i]);
      free(buffer);
      }

      int main(int argc,char** argv){
      int a[] = {1,3,45,6,4,-1,8};
      int b[] = {4,2,4,222,8,45 ,-1};
      find(a,7,b,7);
      }

      this gives me a O(n) algr. but seems to be wasting a lot of memory so
      I am thinking using a hashtable instead. am i in the right track?
      Thanks!

      Comment

      • Siemel Naran

        #18
        Re: Hash table in C++? STL?

        "Claudio Puviani" <puviani@hotmai l.com> wrote in message news:Hexfc.6607 9[color=blue]
        > "Kevin Goodsell" <usenet2.spamfr ee.fusion@never box.com> wrote[/color]
        [color=blue][color=green]
        > > So why the prime table size? I can't see an obvious reason.
        > > If hash values are approximately uniformly distributed between
        > > 0 and N, I'd think that ideal table sizes would be numbers that
        > > evenly divide N, because any remainder would split the table
        > > into two areas with different probabilities of any particular
        > > value hashing into them.[/color]
        >
        > If the hash values are uniformly distributed (which is one characteristic[/color]
        of a[color=blue]
        > good hashing function), then any table size would be equivalent; i.e.,[/color]
        there[color=blue]
        > would be no "ideal" size. The reason that prime numbers are used for table
        > sizes is that many hashing functions are NOT good and sometimes generate[/color]
        hash[color=blue]
        > values that are multiples of one or more numbers. Those values, modulo a
        > relatively non-prime number will cluster. Those same values modulo a[/color]
        relatively[color=blue]
        > prime number will uniformly span the entire set of 0..N-1. In other words,
        > primality is used as insurance against bad hashing functions. You can[/color]
        persuade[color=blue]
        > yourself of this by modeling it in a spreadsheet.[/color]

        This makes sense. Can you write the barebones of a C++ program that can
        illustrate this? I guess you could write a spreadshet too, but the
        newsgroup only wants free text. I can test it if you want.

        We could call call a function hash_function, and we could use my simple
        function in the other post, maybe multiplying by the constant Kevin
        suggested.


        Comment

        • Siemel Naran

          #19
          Re: Hash table in C++? STL?

          "Kevin Goodsell" <usenet2.spamfr ee.fusion@never box.com> wrote in message
          news:Gnqfc.1129 9
          [color=blue][color=green]
          > > size_t hash(const char * s, int buckets) {
          > > size_t h = 0;
          > > for ( ; *s; ++s) h += *s;
          > > return h%bucket.
          > > }[/color]
          >
          > Kernighan & Pike's _The Practice of Programming_ mentioned a similar
          > hash function, but it included (if I recall correctly) multiplying the
          > total by a constant before adding in the next character value. This
          > constant has been experimentally determined to be effective for ASCII
          > text. I'm not sure if anyone really knows why it works. I don't recall
          > what the value was, but I'm sure someone can tell us.[/color]

          I did some research, and they said multiplying by the alphabet size (ie.
          number of valid chars).

          [color=blue][color=green]
          > > But we need hash only the first maxchars characters.[/color]
          >
          > Actually, I think K&P also talked about this, and suggested it was a bad
          > idea. This is from memory again, so I may be wrong, but I believe they
          > mentioned Java using a scheme like this originally. It turned out to not
          > work well, and was changed to use the entire string.[/color]

          Suppose our company coding standard is to prefix namespace and class names
          with ABC_. Thus ABC_Contact, ABC_Thing. Imagine if our hash function used
          just the first 3 chars!

          So I think the number of chars to use depends on the scenario, but it does
          intuitively make sense to me. We could perhaps use the last N chars, middle
          N chars, etc. A templated hashing function makes all this possible.


          Comment

          • Siemel Naran

            #20
            Re: Hash table in C++? STL?

            "pembed2003 " <pembed2003@yah oo.com> wrote in message
            [color=blue]
            > #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))
            >
            > void find(int* one,int c1,int* two,int c2){
            > int i;
            > int* buffer = malloc(SIZE);
            > memset((void*)b uffer,0,SIZE);
            > for(i = 0; i < c1; i++)
            > buffer[one[i] + INT16_MAX] = 1;
            > for(i = 0; i < c2; i++)
            > if(buffer[two[i] + INT16_MAX])
            > printf("%d\n",t wo[i]);
            > free(buffer);
            > }[/color]

            It seems you don't use the first INT16_MAX chars of buffer, or maybe I'm
            reading something wrong.
            [color=blue]
            > int main(int argc,char** argv){
            > int a[] = {1,3,45,6,4,-1,8};
            > int b[] = {4,2,4,222,8,45 ,-1};
            > find(a,7,b,7);
            > }
            >
            > this gives me a O(n) algr. but seems to be wasting a lot of memory so
            > I am thinking using a hashtable instead. am i in the right track?
            > Thanks![/color]

            If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.
            Either loop over the chars in 'one' or 'two', then over the chars in the
            other loop. If you want to use STL, you can use std::count_if and supply a
            binary predicate that employs std::find (the binary predicate will have a
            constructor that takes the array 'one' as an argument).

            Note you can also sort the elements in 'one' and 'two', then loop over both
            containers simultaneously. Running time O(lg(N)) for the sort if using
            std::sort which normally uses quicksort, and O(N) for the scan. I don't
            think the standard provides such an algorithm though.

            You could sort the array in O(N) time using radix sort (but it will use so
            much memory, which is what you want to avoid), or some other specialized
            algorithm. You could also sort only 'one' and loop over the chars in 'two',
            but use binary_search to see if a char in 'one' exists in 'two'.

            If the array size is big, what difference will a hashtable make? You're
            using the hashtable to store elements visited, right? If the hashtable
            bucket size is small then you'll have lots of collisions (lots of elements
            mapping to the same element) and looking one element could be linear time
            (if the collided elements are stored in a vector). If the hashtable bucket
            size is big then you won't have lots of collisions, but then might as well
            use your original solution.


            Comment

            • John Harrison

              #21
              Re: Hash table in C++? STL?

              >[color=blue]
              > Thanks for everyone's help so it looks like map is the standard and
              > should be available for any platform? But hash_map is not standard.
              >[/color]

              Right.
              [color=blue]
              > The reason I am asking for hash table is because I need to write a
              > function that finds the overlap of two integer array. I am currently
              > doing:
              >
              > #include<stdlib .h>
              > #include<string .h>
              > #include<sys/int_limits.h>
              >
              > #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))
              >
              > void find(int* one,int c1,int* two,int c2){
              > int i;
              > int* buffer = malloc(SIZE);
              > memset((void*)b uffer,0,SIZE);
              > for(i = 0; i < c1; i++)
              > buffer[one[i] + INT16_MAX] = 1;
              > for(i = 0; i < c2; i++)
              > if(buffer[two[i] + INT16_MAX])
              > printf("%d\n",t wo[i]);
              > free(buffer);
              > }
              >
              > int main(int argc,char** argv){
              > int a[] = {1,3,45,6,4,-1,8};
              > int b[] = {4,2,4,222,8,45 ,-1};
              > find(a,7,b,7);
              > }
              >
              > this gives me a O(n) algr. but seems to be wasting a lot of memory so
              > I am thinking using a hashtable instead. am i in the right track?
              > Thanks![/color]

              I think the standard answer would be to use a set. This is untested code.

              #include <set>
              #include <algorithm>
              #include <iterator>
              #include <iostream>

              void find(int* one,int c1,int* two,int c2)
              {
              std::set<int> common;
              common.insert(o ne, one + c1);
              common.insert(t wo, two + c2);
              // output common elements to cout
              std::copy(commo n.begin(), common.end(),
              std::ostream_it erator<int>(std ::cout, "\n"));
              }

              That is an O(N lg N) algorithm.

              john


              Comment

              • Kevin Goodsell

                #22
                Re: Hash table in C++? STL?

                Siemel Naran wrote:[color=blue]
                > "Kevin Goodsell" <usenet2.spamfr ee.fusion@never box.com> wrote in message
                > news:Gnqfc.1129 9
                >[color=green]
                >>
                >>Kernighan & Pike's _The Practice of Programming_ mentioned a similar
                >>hash function, but it included (if I recall correctly) multiplying the
                >>total by a constant before adding in the next character value. This
                >>constant has been experimentally determined to be effective for ASCII
                >>text. I'm not sure if anyone really knows why it works. I don't recall
                >>what the value was, but I'm sure someone can tell us.[/color]
                >
                >
                > I did some research, and they said multiplying by the alphabet size (ie.
                > number of valid chars).[/color]

                I don't recall that. The particular version I was talking about has been
                discussed from time to time in the newsgroups, and I found one such
                discussion here:



                If I did that right, the link should take you to Richard Heathfield's
                reply to the thread's original poster.

                They mention a few points. First, K&R2 contains an almost identical
                hashing function on page 144 using 31 as the constant. Second, the
                values suggested by K&P were apparently 31 and 37. Finally, Chris Torek
                suggests 33, a value he got from James Gosling. During widespread
                experimentation for the Berkley DB library, it did better on average
                than 31.

                -Kevin
                --
                My email address is valid, but changes periodically.
                To contact me please use the address from a recent posting.

                Comment

                • pembed2003

                  #23
                  Re: Hash table in C++? STL?

                  "John Harrison" <john_andronicu s@hotmail.com> wrote in message news:<c5l5pa$2t 12k$1@ID-196037.news.uni-berlin.de>...[color=blue][color=green]
                  > >
                  > > Hi John,
                  > > Can you point me to the documentation where I can find the map and
                  > > hash_map data structure? Thanks for the help![/color]
                  >
                  > http://www.dinkumware.com/refxcpp.html
                  >
                  > but remember because hasp_map is non standard the description here might
                  > differ from the implementation you have (if you have one at all).
                  >
                  > john[/color]

                  I read the documentation but I don't fully understand it. I don't want
                  to sound like a dump axx but so far, I have tried:

                  #include<map>
                  #include<string >

                  using namespace std;

                  int main(int argc,char** argv){
                  std::map<std::s tring,int>m();
                  m.insert("hello ",1); // doesn't compile? why?
                  return 0;
                  }

                  I am sure I did something wrong maybe even in the constructor. What I
                  mean to do is:

                  * m is a map whoes key is a std::string and values will be of type int
                  * insert the string "hello" into m with value 1.

                  I am trying to read the man page with 'man map.h' but nothing show up.
                  Where is the man page and how can I find map.h if I want to look at
                  it?

                  Can someone tell me what's wrong? Can you post a short working example
                  so I can learn from it? Thanks!

                  Comment

                  • Kevin Goodsell

                    #24
                    Re: Hash table in C++? STL?

                    pembed2003 wrote:
                    [color=blue]
                    >
                    > I read the documentation but I don't fully understand it. I don't want
                    > to sound like a dump axx but so far, I have tried:
                    >
                    > #include<map>
                    > #include<string >
                    >
                    > using namespace std;
                    >
                    > int main(int argc,char** argv){
                    > std::map<std::s tring,int>m();[/color]

                    I don't think you really wanted to declare a function called 'm' that
                    takes no arguments and returns a std::map. Lose the parentheses.
                    [color=blue]
                    > m.insert("hello ",1); // doesn't compile? why?[/color]

                    I don't believe there is a form of map::insert that takes two arguments.
                    Try one of these:

                    m["hello"] = 1;

                    OR

                    m.insert(std::m ap<std::string, int>::value_typ e("hello", 1));

                    Obviously I'd recommend the former for this case. When you do need the
                    insert() function it might help to have a typedef handy (in fact, this
                    is nice in many cases):

                    typedef std::map<std::s tring, int> Map;
                    Map m;
                    m.insert(Map::v alue_type("hell o", 1));
                    [color=blue]
                    > return 0;
                    > }
                    >
                    > I am sure I did something wrong maybe even in the constructor. What I
                    > mean to do is:
                    >
                    > * m is a map whoes key is a std::string and values will be of type int
                    > * insert the string "hello" into m with value 1.
                    >
                    > I am trying to read the man page with 'man map.h' but nothing show up.
                    > Where is the man page and how can I find map.h if I want to look at
                    > it?[/color]

                    1) 'man pages' are not defined by the standard. If you have them, the
                    names, locations, commands, etc. depend on your system.

                    2) Why would it be map.h? The header is <map>. Typically the file name
                    is the same as the name in the brackets, i.e. 'map'. This is not
                    required of course, but it's usually the case.

                    3) Locations of headers (if they exist as files -- which they typically
                    do) depend on your implementation. Consult the documentation or a group
                    that discusses your system and/or compiler.

                    -Kevin
                    --
                    My email address is valid, but changes periodically.
                    To contact me please use the address from a recent posting.

                    Comment

                    • Kevin Goodsell

                      #25
                      Re: Hash table in C++? STL?

                      Kevin Goodsell wrote:
                      [color=blue]
                      >
                      > I don't believe there is a form of map::insert that takes two arguments.[/color]

                      Come to think of it, that's wrong. There is a form that takes a "hint"
                      argument, for example. But there is not one that takes the key and value
                      as separate arguments, I'm fairly sure.

                      -Kevin
                      --
                      My email address is valid, but changes periodically.
                      To contact me please use the address from a recent posting.

                      Comment

                      • Jerry Coffin

                        #26
                        Re: Hash table in C++? STL?

                        "Siemel Naran" <SiemelNaran@RE MOVE.att.net> wrote in message news:<A6pfc.323 37$i74.709706@b gtnsc04-news.ops.worldn et.att.net>...

                        [ ... ]
                        [color=blue]
                        > size_t hash(const char * s, int buckets) {
                        > size_t h = 0;
                        > for ( ; *s; ++s) h += *s;
                        > return h%bucket.
                        > }
                        >
                        > But we need hash only the first maxchars characters.[/color]

                        As hash functions go, this is better than many, but less than ideal
                        for general use. Long strings might produce values larger than a
                        size_t can represent, producing values that vary from one
                        implementation to another, while short strings can only produce quite
                        small values (e.g. with 8-bit characters, a 3-character string can
                        only produce a value up to 765 and typical English text will only
                        produce values in a MUCH more restricted range). As one final caveat,
                        since it uses plain char, which might be signed, it could produce
                        negative hash values, which can often cause problems.

                        One way of dealing with this is something like this:

                        // warning: untested code.

                        size_t hash(unsigned char const *key, size_t buckets) {

                        const int bits = 32; // assume 32 bits is enough for the hash
                        value.
                        size_t h = 0x55555555; // alternating 0's and 1's.
                        size_t carry;
                        size_t shift = bits - CHAR_BIT;
                        size_t mask = ((1 << CHAR_BIT) -1) << shift;

                        while (*key) {
                        h ^= *key++;
                        carry = (h & mask) >> shift;
                        h <<= CHAR_BIT;
                        h |= carry;
                        }
                        return h % buckets;
                        }

                        The left-shift is to propagate the effects of the input characters
                        across the entire hash value more quickly. The right-shift (producing
                        an N-bit rotate) keeps from losing the effects of the characters from
                        early in the string.

                        The result is that every bit in the input string has an effect on the
                        output, and the effects of the characters get distributed throughout
                        the output fairly quickly.

                        Results are undoubtedly open to variation, but most of the
                        "improvemen ts" I've tried to this scheme have taken more time than
                        they saved.
                        [color=blue]
                        > The hash may increase numbuckets dynamically as it grows, so numbuckets is a
                        > parameter to operator(). But the number of chars to hash is a constant, and
                        > should not change as we increase or decrease the number of buckets, so it is
                        > argument to the constructor and gets stored in the class.[/color]

                        In typical use, you normally want to hash the entire key.

                        Comment

                        • pembed2003

                          #27
                          Re: Hash table in C++? STL?

                          "Siemel Naran" <SiemelNaran@RE MOVE.att.net> wrote in message news:<wfCfc.345 40$i74.784430@b gtnsc04-news.ops.worldn et.att.net>...[color=blue]
                          > "pembed2003 " <pembed2003@yah oo.com> wrote in message
                          >[color=green]
                          > > #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))
                          > >
                          > > void find(int* one,int c1,int* two,int c2){
                          > > int i;
                          > > int* buffer = malloc(SIZE);
                          > > memset((void*)b uffer,0,SIZE);
                          > > for(i = 0; i < c1; i++)
                          > > buffer[one[i] + INT16_MAX] = 1;
                          > > for(i = 0; i < c2; i++)
                          > > if(buffer[two[i] + INT16_MAX])
                          > > printf("%d\n",t wo[i]);
                          > > free(buffer);
                          > > }[/color]
                          >
                          > It seems you don't use the first INT16_MAX chars of buffer, or maybe I'm
                          > reading something wrong.
                          >[color=green]
                          > > int main(int argc,char** argv){
                          > > int a[] = {1,3,45,6,4,-1,8};
                          > > int b[] = {4,2,4,222,8,45 ,-1};
                          > > find(a,7,b,7);
                          > > }
                          > >
                          > > this gives me a O(n) algr. but seems to be wasting a lot of memory so
                          > > I am thinking using a hashtable instead. am i in the right track?
                          > > Thanks![/color]
                          >
                          > If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.
                          >[/color]

                          Is my solution a O(N^2) solution? I think it's a O(N) right?
                          Becasue...

                          Assume array a and b both have N elements, the program loops through
                          both array exactly once, so we have:

                          for(i = 0; i < N; i++) // O(N)
                          array_one[i] = 1;
                          for(i = 0; i < N; i++) // O(N)
                          does array_two[i] in array_one[i]

                          N + N = 2N

                          which is a O(N)
                          [color=blue]
                          > Either loop over the chars in 'one' or 'two', then over the chars in the
                          > other loop. If you want to use STL, you can use std::count_if and supply a
                          > binary predicate that employs std::find (the binary predicate will have a
                          > constructor that takes the array 'one' as an argument).
                          >[/color]

                          Will a binary search beat a hash lookup? I know this has no exact
                          answer but generally speaking will a binary search be faster than a
                          hash lookup?
                          [color=blue]
                          > Note you can also sort the elements in 'one' and 'two', then loop over both
                          > containers simultaneously. Running time O(lg(N)) for the sort if using
                          > std::sort which normally uses quicksort, and O(N) for the scan. I don't
                          > think the standard provides such an algorithm though.[/color]

                          But this solution requires:

                          1. Sort one array with best case O(N*log N) using a quicksort
                          2. Loop through another array lookup for dups which is a O(N)

                          I think it goes like:

                          sort(array_one) ; // O(N*log N)
                          for(i = 0; i < N; i++) // O(N)
                          look for array_two[i] in array_one

                          which is a O(N*log N) algr. It doesn't look like it will beat my O(N)
                          algr. Am I right?

                          After I discovered map, I am actually doing:

                          void find(int* a,int c1,int* b,int c2){
                          std::map<int,in t> m;
                          for(int i = 0; i < c1; i++)
                          m[a[i]] = 1;
                          for(int j = 0; j < c2; j++)
                          if(m[b[j]])
                          std::cout<<b[j]<<std::endl;
                          }

                          What do you think about that? Thanks!

                          Comment

                          • Siemel Naran

                            #28
                            Re: Hash table in C++? STL?

                            "Kevin Goodsell" <usenet2.spamfr ee.fusion@never box.com> wrote in message
                            news:C%
                            [color=blue]
                            > I don't believe there is a form of map::insert that takes two arguments.
                            >
                            > Come to think of it, that's wrong. There is a form that takes a "hint"
                            > argument, for example. But there is not one that takes the key and value
                            > as separate arguments, I'm fairly sure.[/color]

                            Right. There's also another 2 argument insert, namely map::insert(Ite r
                            begin, Iter end) where Iter is a template argument. It's basically batch
                            insert. If the range is sorted then the running time of the algorithm is
                            O(N). If the range is not sorted then the running time is O(N*lg(N)). So
                            this is a really smart function that detects if your range is sorted! I
                            think it probably calls the 2 argument insert where the first argument is a
                            hint.


                            Comment

                            • Siemel Naran

                              #29
                              Re: Hash table in C++? STL?

                              "pembed2003 " <pembed2003@yah oo.com> wrote in message[color=blue]
                              > "Siemel Naran" <SiemelNaran@RE MOVE.att.net> wrote in message[/color]
                              news:<wfCfc.345 40
                              [color=blue][color=green][color=darkred]
                              > > > #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))[/color][/color][/color]

                              Prefer to use const size_t instead of #define.
                              [color=blue][color=green][color=darkred]
                              > > > void find(int* one,int c1,int* two,int c2){
                              > > > int i;
                              > > > int* buffer = malloc(SIZE);
                              > > > memset((void*)b uffer,0,SIZE);
                              > > > for(i = 0; i < c1; i++)
                              > > > buffer[one[i] + INT16_MAX] = 1;
                              > > > for(i = 0; i < c2; i++)
                              > > > if(buffer[two[i] + INT16_MAX])
                              > > > printf("%d\n",t wo[i]);
                              > > > free(buffer);
                              > > > }[/color][/color][/color]
                              [color=blue][color=green][color=darkred]
                              > > > this gives me a O(n) algr. but seems to be wasting a lot of memory so
                              > > > I am thinking using a hashtable instead. am i in the right track?
                              > > > Thanks![/color]
                              > >
                              > > If the array size 'a' or 'one' is small then the O(N^2) algorithm is[/color][/color]
                              fine.[color=blue]
                              >
                              > Is my solution a O(N^2) solution? I think it's a O(N) right?[/color]

                              It is O(N). Sorry for confusing. The O(N^2) algorithm is when you scan all
                              the chars in 'one', and for each scan all the chars in 'two'.
                              [color=blue]
                              > Becasue...
                              >
                              > Assume array a and b both have N elements, the program loops through
                              > both array exactly once, so we have:
                              >
                              > for(i = 0; i < N; i++) // O(N)
                              > array_one[i] = 1;
                              > for(i = 0; i < N; i++) // O(N)
                              > does array_two[i] in array_one[i]
                              >
                              > N + N = 2N
                              >
                              > which is a O(N)[/color]

                              Right.
                              [color=blue][color=green]
                              > > Either loop over the chars in 'one' or 'two', then over the chars in the
                              > > other loop. If you want to use STL, you can use std::count_if and[/color][/color]
                              supply a[color=blue][color=green]
                              > > binary predicate that employs std::find (the binary predicate will have[/color][/color]
                              a[color=blue][color=green]
                              > > constructor that takes the array 'one' as an argument).[/color]
                              >
                              > Will a binary search beat a hash lookup? I know this has no exact
                              > answer but generally speaking will a binary search be faster than a
                              > hash lookup?[/color]

                              Guessing hash lookup is faster in general. It's worth running an experiment
                              to see for your case. Who knows, maybe the hash function is too complicated
                              to compute and the sorted array is faster. It's pretty easy to all methods
                              given you have an implementation of hash_map available. If you can, post
                              your results if you do the experiment, because it's nice to know. (State
                              the size of N, result using: original algorithm, hash_map, std::map,
                              binary_search, N^2 method, double sort).
                              [color=blue][color=green]
                              > > Note you can also sort the elements in 'one' and 'two', then loop over[/color][/color]
                              both[color=blue][color=green]
                              > > containers simultaneously. Running time O(lg(N)) for the sort if using
                              > > std::sort which normally uses quicksort, and O(N) for the scan. I don't
                              > > think the standard provides such an algorithm though.[/color]
                              >
                              > But this solution requires:
                              >
                              > 1. Sort one array with best case O(N*log N) using a quicksort
                              > 2. Loop through another array lookup for dups which is a O(N)
                              >
                              > I think it goes like:
                              >
                              > sort(array_one) ; // O(N*log N)
                              > for(i = 0; i < N; i++) // O(N)
                              > look for array_two[i] in array_one
                              >
                              > which is a O(N*log N) algr. It doesn't look like it will beat my O(N)
                              > algr. Am I right?[/color]

                              Yes, I think your O(N) algorithm is probably best, but it consumes lots of
                              space. Nevertheless, it is possible that the binary search method is faster
                              than hash_table.

                              Also, you can sort both arrays, and walk through each simultaneously in
                              linear time.
                              [color=blue]
                              > After I discovered map, I am actually doing:
                              >
                              > void find(int* a,int c1,int* b,int c2){
                              > std::map<int,in t> m;
                              > for(int i = 0; i < c1; i++)
                              > m[a[i]] = 1;
                              > for(int j = 0; j < c2; j++)
                              > if(m[b[j]])
                              > std::cout<<b[j]<<std::endl;
                              > }
                              >
                              > What do you think about that? Thanks![/color]

                              Looks right. You can use std::set too, which in principle should be a tad
                              faster. How fast does it run? Also note that a node in a map is pretty
                              expensive because you have to allocate space for one parent and two child
                              pointers, and if sizeof(int)==si zeof(node*), this is 4 times the memory per
                              node, plus the cost of maintaining the pointers. I'm willing to bet you a
                              $1 that sorting 'a' with std::sort and using std::binary_sea rch instead is
                              faster than map and uses less memory.

                              You can replace std::map with std::hash_map if you are using STL. Don't
                              know about the other implementations , probably something similar.

                              original algorithm:
                              What you originally wrote
                              Still not sure why array size is double, why the multiply by 2 at the end?
                              const size_t SIZE = ((sizeof(int)) * (INT16_MAX) * (2))

                              std::map:
                              The above algorithm

                              hash_map (probably largest:
                              The above algorithm with std::map replaced by hash_map. Maybe there is a
                              function to set the bucket size that you can try to tweak for optimal
                              performance?

                              binary_search:
                              Similar to the above algorithm.
                              void find(int* a,int c1,int* b,int c2){
                              std::sort(a, a+c1);
                              for(int j = 0; j < c2; j++)
                              if(std::binary_ search(a, a+c1, b[j]))
                              std::cout<<b[j]<<std::endl;
                              }

                              N^2 method:
                              void find(int* a,int c1,int* b,int c2){
                              for(int j = 0; j < c2; j++)
                              if(std::find(a, a+c1, b[j]))
                              std::cout<<b[j]<<std::endl;
                              }

                              double sort:
                              void find(int* a,int c1,int* b,int c2){
                              std::sort(a, a+c1);
                              std::sort(b, b+c1);
                              int i =0, j = 0;
                              while (true) {
                              if (i==c1 || j==c2) break;
                              const int ai = a[i];
                              const int bj = b[j];
                              if (ai < bj) ++i;
                              else if (ai > bj) ++j;
                              else {
                              std::cout<<bj<< std::endl;
                              ++i;
                              ++j;
                              }
                              }
                              }
                              It's 1:30am, so I don't claim the above is error free!



                              Comment

                              • Siemel Naran

                                #30
                                Re: Hash table in C++? STL?

                                "John Harrison" <john_andronicu s@hotmail.com> wrote in message
                                news:c5msbf$3he bc$1@ID-
                                [color=blue]
                                > void find(int* one,int c1,int* two,int c2)
                                > {
                                > std::set<int> common;
                                > common.insert(o ne, one + c1);
                                > common.insert(t wo, two + c2);
                                > // output common elements to cout
                                > std::copy(commo n.begin(), common.end(),
                                > std::ostream_it erator<int>(std ::cout, "\n"));
                                > }
                                >
                                > That is an O(N lg N) algorithm.[/color]

                                This is a good algorithm!


                                Comment

                                Working...