Re: Hash table in C++? STL?
Claudio Puviani wrote:
[color=blue]
> "Kevin Goodsell" <usenet2.spamfr ee.fusion@never box.com> wrote
>[color=green]
>>So why the prime table size? I can't see an obvious reason.
>>If hash values are approximately uniformly distributed between
>>0 and N, I'd think that ideal table sizes would be numbers that
>>evenly divide N, because any remainder would split the table
>>into two areas with different probabilities of any particular
>>value hashing into them.[/color]
>
>
> If the hash values are uniformly distributed (which is one characteristic of a
> good hashing function), then any table size would be equivalent; i.e., there
> would be no "ideal" size. The reason that prime numbers are used for table
> sizes is that many hashing functions are NOT good and sometimes generate hash
> values that are multiples of one or more numbers. Those values, modulo a
> relatively non-prime number will cluster. Those same values modulo a relatively
> prime number will uniformly span the entire set of 0..N-1. In other words,
> primality is used as insurance against bad hashing functions. You can persuade
> yourself of this by modeling it in a spreadsheet.
>[/color]
I thought it might be something like that. Although, I disagree that
there's no "ideal" size with uniformly distributed hash values. I
definitely think some sizes are better than others (though it may be a
small difference). As a trivial example, suppose the range of hash
values is 0 to 75, and the table size is 50. Hash values from 0 to 25,
and also from 51 to 75, end up in the lower 25 slots, while only those
hash values in the range 26 to 50 end up in the upper 25 slots. As a
result, the probability of a key hashing to the first 25 slots is twice
as high as the probability of hashing to the last 25.
I doubt this makes any difference in practice, but it is what I had in
mind in my previous post.
-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Claudio Puviani wrote:
[color=blue]
> "Kevin Goodsell" <usenet2.spamfr ee.fusion@never box.com> wrote
>[color=green]
>>So why the prime table size? I can't see an obvious reason.
>>If hash values are approximately uniformly distributed between
>>0 and N, I'd think that ideal table sizes would be numbers that
>>evenly divide N, because any remainder would split the table
>>into two areas with different probabilities of any particular
>>value hashing into them.[/color]
>
>
> If the hash values are uniformly distributed (which is one characteristic of a
> good hashing function), then any table size would be equivalent; i.e., there
> would be no "ideal" size. The reason that prime numbers are used for table
> sizes is that many hashing functions are NOT good and sometimes generate hash
> values that are multiples of one or more numbers. Those values, modulo a
> relatively non-prime number will cluster. Those same values modulo a relatively
> prime number will uniformly span the entire set of 0..N-1. In other words,
> primality is used as insurance against bad hashing functions. You can persuade
> yourself of this by modeling it in a spreadsheet.
>[/color]
I thought it might be something like that. Although, I disagree that
there's no "ideal" size with uniformly distributed hash values. I
definitely think some sizes are better than others (though it may be a
small difference). As a trivial example, suppose the range of hash
values is 0 to 75, and the table size is 50. Hash values from 0 to 25,
and also from 51 to 75, end up in the lower 25 slots, while only those
hash values in the range 26 to 50 end up in the upper 25 slots. As a
result, the probability of a key hashing to the first 25 slots is twice
as high as the probability of hashing to the last 25.
I doubt this makes any difference in practice, but it is what I had in
mind in my previous post.
-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Comment