General Hash Functions In Python

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Arash Partow

    General Hash Functions In Python

    Hi all,

    I've ported various hash functions to python if anyone is interested:



    def RSHash(key):
    a = 378551
    b = 63689
    hash = 0
    for i in range(len(key)) :
    hash = hash * a + ord(key[i])
    a = a * b
    return (hash & 0x7FFFFFFF)


    def JSHash(key):
    hash = 1315423911
    for i in range(len(key)) :
    hash ^= ((hash << 5) + ord(key[i]) + (hash >2))
    return (hash & 0x7FFFFFFF)


    def PJWHash(key):
    BitsInUnsignedI nt = 4 * 8
    ThreeQuarters = long((BitsInUns ignedInt * 3) / 4)
    OneEighth = long(BitsInUnsi gnedInt / 8)
    HighBits = (0xFFFFFFFF) << (BitsInUnsigned Int - OneEighth)
    hash = 0
    test = 0

    for i in range(len(key)) :
    hash = (hash << OneEighth) + ord(key[i])
    test = hash & HighBits
    if test != 0:
    hash = (( hash ^ (test >ThreeQuarters) ) & (~HighBits));
    return (hash & 0x7FFFFFFF)


    def ELFHash(key):
    hash = 0
    x = 0
    for i in range(len(key)) :
    hash = (hash << 4) + ord(key[i])
    x = hash & 0xF0000000
    if x != 0:
    hash ^= (x >24)
    hash &= ~x
    return (hash & 0x7FFFFFFF)


    def BKDRHash(key):
    seed = 131 # 31 131 1313 13131 131313 etc..
    hash = 0
    for i in range(len(key)) :
    hash = (hash * seed) + ord(key[i])
    return (hash & 0x7FFFFFFF)


    def SDBMHash(key):
    hash = 0
    for i in range(len(key)) :
    hash = ord(key[i]) + (hash << 6) + (hash << 16) - hash;
    return (hash & 0x7FFFFFFF)


    def DJBHash(key):
    hash = 5381
    for i in range(len(key)) :
    hash = ((hash << 5) + hash) + ord(key[i])
    return (hash & 0x7FFFFFFF)


    def DEKHash(key):
    hash = len(key);
    for i in range(len(key)) :
    hash = ((hash << 5) ^ (hash >27)) ^ ord(key[i])
    return (hash & 0x7FFFFFFF)


    def APHash(key):
    hash = 0
    for i in range(len(key)) :
    if ((i & 1) == 0):
    hash ^= ((hash << 7) ^ ord(key[i]) ^ (hash >3))
    else:
    hash ^= (~((hash << 11) ^ ord(key[i]) ^ (hash >5)))
    return (hash & 0x7FFFFFFF)


    print RSHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
    print JSHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
    print PJWHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
    print ELFHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
    print BKDRHash('abcde fghijklmnopqrst uvwxyz123456789 0')
    print SDBMHash('abcde fghijklmnopqrst uvwxyz123456789 0')
    print DJBHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
    print DEKHash ('abcdefghijklm nopqrstuvwxyz12 34567890')
    print APHash ('abcdefghijklm nopqrstuvwxyz12 34567890')





    Arash Partow
    _______________ _______________ _______________ ___________
    Be one who knows what they don't know,
    Instead of being one who knows not what they don't know,
    Thinking they know everything about all things.


  • Paul Rubin

    #2
    Re: General Hash Functions In Python

    "Arash Partow" <partow@gmail.c omwrites:
    I've ported various hash functions to python if anyone is interested:
    Are these useful for any particular interoperabilit y purposes?
    Otherwise I'd say just one such hash is plenty, or maybe don't even
    bother, since Python's built-in dicts are sufficient for most
    applications that call for hashing, and the cryptographic hashes are
    available for when you really care about uniformity of the hash
    output.
    for i in range(len(key)) :
    hash = hash * a + ord(key[i])
    a = a * b
    As a stylistic issue, I'd write this:

    for c in key:
    hash = hash * a + ord(c)
    a *= b

    and similarly for the other similar loops.

    Comment

    • Arash Partow

      #3
      Re: General Hash Functions In Python

      Hi Paul,

      For different data types different hash functions work
      better/worse aka fewer or more collisions.

      I believe the more choice people have and also the more
      ways people see how a particular thing can be done, then
      the easier it will be for them to come up with their own
      specific efficient solutions.

      That said, I believe at least one (most likely more) of
      the hash functions in the group above will most always work
      better (ala less collisions) than the standard default hash
      available in the built-in dict for any random set of strings.

      Please feel free to prove me wrong :)




      Arash Partow
      _______________ _______________ _______________ ___________
      Be one who knows what they don't know,
      Instead of being one who knows not what they don't know,
      Thinking they know everything about all things.



      Paul Rubin wrote:
      "Arash Partow" <partow@gmail.c omwrites:
      I've ported various hash functions to python if anyone is interested:
      >
      Are these useful for any particular interoperabilit y purposes?
      Otherwise I'd say just one such hash is plenty, or maybe don't even
      bother, since Python's built-in dicts are sufficient for most
      applications that call for hashing, and the cryptographic hashes are
      available for when you really care about uniformity of the hash
      output.
      >
      for i in range(len(key)) :
      hash = hash * a + ord(key[i])
      a = a * b
      >
      As a stylistic issue, I'd write this:
      >
      for c in key:
      hash = hash * a + ord(c)
      a *= b
      >
      and similarly for the other similar loops.

      Comment

      • Paul Rubin

        #4
        Re: General Hash Functions In Python

        "Arash Partow" <partow@gmail.c omwrites:
        For different data types different hash functions work
        better/worse aka fewer or more collisions.
        But you give no indication of which of those hashes works best for
        what kind of data. How is the user supposed to figure out which one
        to choose?

        Comment

        • Robert Kern

          #5
          Re: General Hash Functions In Python

          Arash Partow wrote:
          That said, I believe at least one (most likely more) of
          the hash functions in the group above will most always work
          better (ala less collisions) than the standard default hash
          available in the built-in dict for any random set of strings.
          >
          Please feel free to prove me wrong :)
          You have it backwards. You are the one making the positive assertion. You get to
          prove yourself correct.

          --
          Robert Kern

          "I have come to believe that the whole world is an enigma, a harmless enigma
          that is made terrible by our own mad attempt to interpret it as though it had
          an underlying truth."
          -- Umberto Eco

          Comment

          • Arash Partow

            #6
            Re: General Hash Functions In Python

            Trial and error - how else? :)

            I believe finding a perfect hash function for
            every kind and combination of data is a very
            very time consuming operation. Also there is
            the common case where the data is online
            (ie: stateless) that said it doesn't mean you
            can't make assumptions about the kind of data.



            Arash Partow
            _______________ _______________ _______________ ___________
            Be one who knows what they don't know,
            Instead of being one who knows not what they don't know,
            Thinking they know everything about all things.



            Paul Rubin wrote:
            "Arash Partow" <partow@gmail.c omwrites:
            For different data types different hash functions work
            better/worse aka fewer or more collisions.
            >
            But you give no indication of which of those hashes works best for
            what kind of data. How is the user supposed to figure out which one
            to choose?

            Comment

            • Arash Partow

              #7
              Re: General Hash Functions In Python

              That is true, but I'm not about to do something
              that might potentially prove my point wrong... :)



              Arash Partow
              _______________ _______________ _______________ ___________
              Be one who knows what they don't know,
              Instead of being one who knows not what they don't know,
              Thinking they know everything about all things.



              Robert Kern wrote:
              >
              You have it backwards. You are the one making the positive assertion. You get to
              prove yourself correct.
              >
              --
              Robert Kern
              >
              "I have come to believe that the whole world is an enigma, a harmless enigma
              that is made terrible by our own mad attempt to interpret it as though it had
              an underlying truth."
              -- Umberto Eco

              Comment

              • Stefan Behnel

                #8
                Re: General Hash Functions In Python

                Arash Partow wrote:
                I've ported various hash functions to python if anyone is interested:
                [snip]
                Ok, so if you think they are useful, what about writing up an article for the
                Python Cookbook that describes their usage and specific advantages/disadvantages?



                Stefan

                Comment

                • John Machin

                  #9
                  Re: General Hash Functions In Python

                  On 17/07/2006 5:52 PM, Arash Partow wrote:
                  Hi Paul,
                  >
                  For different data types different hash functions work
                  better/worse aka fewer or more collisions.
                  >
                  I believe the more choice people have and also the more
                  ways people see how a particular thing can be done, then
                  the easier it will be for them to come up with their own
                  specific efficient solutions.
                  Who is likely to bother? In timbot we trust. Have you read the comments
                  at the start of Objects/dictobject.c?
                  >
                  That said, I believe at least one (most likely more) of
                  the hash functions in the group above will most always work
                  better (ala less collisions) than the standard default hash
                  available in the built-in dict for any random set of strings.
                  [Aside: it's not "in the built-in dict". Any type which wants its
                  instances to be hashable defines its own hash method, one that suits the
                  type.]

                  This belief would be based on:
                  (a) actual testing by you
                  or (b) a refereed academic paper which did such tests on hash functions
                  (including the Python "standard default hash")
                  or (c) ...???

                  What is the Python "standard default hash", anyway? It is written in C.
                  Wouldn't it have been a good idea to provide a Python function for it,
                  so that people who are going to "come up with their own specific
                  efficient solutions" had something to compare with? Or perhaps they are
                  intended to "port" your functions back into C?

                  A few more questions:

                  Why have you truncated the answers to 31 bits?

                  Have you tested that these functions produce the same output (apart from
                  the 31-bit thing) as the originals that you worked from? The reason for
                  asking is that Python unlike C doesn't lose any bits in the <<
                  operation; if this is followed by a >you may be shifting some unwanted
                  1-bits back in again.

                  Talking about not losing bits: For your 36-byte example input, the
                  SDBMHash (with its << 16) is up to about 566 bits before you truncate it
                  to 31. A little over the top, perhaps. Maybe not the fastest way of
                  doing it.

                  What is the purpose of the calls to long() in PJWHash?

                  And the $64K question: What is the quintessential difference between
                  PJWHash and ELFHash?

                  Cheers,
                  John

                  Comment

                  • Arash Partow

                    #10
                    Re: General Hash Functions In Python

                    John Machin wrote:
                    Who is likely to bother? In timbot we trust. Have you read the comments
                    at the start of Objects/dictobject.c?
                    >
                    No I haven't probably wont be anytime soon, as far as time, well
                    people interested, as is how started my original port, would be more
                    than willing to try/assess the routines for sets of strings that they
                    wish to hash etc.. this site may help explain plus has some code
                    snippets that may help you understand what I mean.



                    [Aside: it's not "in the built-in dict". Any type which wants its
                    instances to be hashable defines its own hash method, one that suits the
                    type.]
                    >
                    This belief would be based on:
                    (a) actual testing by you
                    or (b) a refereed academic paper which did such tests on hash functions
                    (including the Python "standard default hash")
                    or (c) ...???
                    >
                    Experience has shown me that the belief than one "default" way of
                    hashing is generally not the optimal way for the overwhelming
                    majority of data, but then again....


                    What is the Python "standard default hash", anyway? It is written in C.
                    Wouldn't it have been a good idea to provide a Python function for it,
                    so that people who are going to "come up with their own specific
                    efficient solutions" had something to compare with? Or perhaps they are
                    intended to "port" your functions back into C?
                    >
                    The above link provides a very simple hash test framework, the
                    "standard default hash" can be placed in there and tested amongst
                    the other algorithms.

                    A few more questions:
                    >
                    Why have you truncated the answers to 31 bits?
                    >
                    algorithms were adapted from unsigned int, i should have removed that
                    final truncation in python.

                    Have you tested that these functions produce the same output (apart from
                    the 31-bit thing) as the originals that you worked from? The reason for
                    asking is that Python unlike C doesn't lose any bits in the <<
                    operation; if this is followed by a >you may be shifting some unwanted
                    1-bits back in again.
                    >
                    Some of them do others don't (not really important unless you are
                    trying to be compatible with other implementations which this is
                    not), I am aware of python not truncating/wrapping of values under
                    various operations, I believe its a nice little side effect from
                    python which gives more bits to play with as long as you don't
                    truncate them as I have.

                    Talking about not losing bits: For your 36-byte example input, the
                    SDBMHash (with its << 16) is up to about 566 bits before you truncate it
                    to 31. A little over the top, perhaps. Maybe not the fastest way of
                    doing it.
                    >
                    Possibly, do you have a better solution I'm very keen to learn...

                    What is the purpose of the calls to long() in PJWHash?
                    >
                    trying to cast to long, looking at it now its rather superfluous.

                    And the $64K question: What is the quintessential difference between
                    PJWHash and ELFHash?
                    >
                    Nothing, elf is essentially pjw, its just optimised for 32-bit systems
                    in that the calculation for th's etc are static where has pjw
                    required sizeof to calc the th's which i couldn't find a way of doing,
                    so i fudged it in the hope that maybe sometime in the future a work
                    around of sorts could be developed.

                    Again this was just a simple posting, I hoped to maybe gets some
                    comments and may present some ideas to people, I don't expect anyone
                    to drop everything and start using these, just thought it might be
                    something interesting for this NG. btw I'm not a very good python
                    programmer ATM.




                    Arash Partow
                    _______________ _______________ _______________ ___________
                    Be one who knows what they don't know,
                    Instead of being one who knows not what they don't know,
                    Thinking they know everything about all things.


                    Comment

                    • Arash Partow

                      #11
                      Re: General Hash Functions In Python

                      I would like to but I think my lack of language and time will
                      be a barrier. Also I don't believe there is much material there
                      to warrant a technical write up.



                      Arash Partow
                      _______________ _______________ _______________ ___________
                      Be one who knows what they don't know,
                      Instead of being one who knows not what they don't know,
                      Thinking they know everything about all things.



                      Stefan Behnel wrote:
                      Arash Partow wrote:
                      I've ported various hash functions to python if anyone is interested:
                      [snip]
                      >
                      Ok, so if you think they are useful, what about writing up an article for the
                      Python Cookbook that describes their usage and specific advantages/disadvantages?
                      >

                      >
                      Stefan

                      Comment

                      • John Machin

                        #12
                        Re: General Hash Functions In Python

                        On 17/07/2006 10:13 PM, Arash Partow wrote:
                        John Machin wrote:
                        >Who is likely to bother? In timbot we trust. Have you read the comments
                        >at the start of Objects/dictobject.c?
                        >>
                        No I haven't probably wont be anytime soon,
                        Perhaps you should, if you profess an interest in hashed lookup -- it
                        gives some interesting commentary on the second aspect: collision
                        handling. What matters is the *total* time to return an answer.
                        as far as time, well
                        people interested, as is how started my original port, would be more
                        than willing to try/assess the routines for sets of strings that they
                        wish to hash etc.
                        this site may help explain plus has some code
                        snippets that may help you understand what I mean.
                        >
                        http://www.partow.net/programming/ha...ons/index.html
                        That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's
                        a Melbourne academic named Justin Zobel who has done something amazingly
                        similar:



                        Searching for "Justin Sobel" did lead me to a Russian website which
                        apart from repeating your typo/reado/whatevero did propose (and test)
                        some more hash functions.



                        In particular look at the "rot13" function which was right up near the
                        front as far as number of collisions goes, and which would appear (my
                        guess based on reading the source) to be very fast (with the right
                        compiler (e.g. gcc 4)).
                        >
                        >
                        >A few more questions:
                        >>
                        >Have you tested that these functions produce the same output (apart from
                        >the 31-bit thing) as the originals that you worked from? The reason for
                        >asking is that Python unlike C doesn't lose any bits in the <<
                        >operation; if this is followed by a >you may be shifting some unwanted
                        > 1-bits back in again.
                        >>
                        >
                        Some of them do others don't (not really important unless you are
                        trying to be compatible with other implementations which this is
                        not),
                        I would have thought it important especially in the case of well-known
                        functions whose properties have been discussed in the literature that
                        you should not publish a version that gives a different answer, without
                        noting that fact prominently.
                        I am aware of python not truncating/wrapping of values under
                        various operations, I believe its a nice little side effect from
                        python which gives more bits to play with as long as you don't
                        truncate them as I have.
                        >
                        >
                        >Talking about not losing bits: For your 36-byte example input, the
                        >SDBMHash (with its << 16) is up to about 566 bits before you truncate it
                        >to 31. A little over the top, perhaps. Maybe not the fastest way of
                        >doing it.
                        >>
                        Possibly, do you have a better solution I'm very keen to learn...
                        You can't avoid using Python longs if you want to simulate unsigned
                        32-bit arithmetic. However judicious truncation can be used to stop the
                        longs becoming longer and slower. General rules:
                        1. Avoid exceeding 32 bits where possible. E.g. instead of
                        hash <<= 8
                        do
                        hash = (hash & 0xFFFFFF) << 8
                        2. Where unavoidable (e.g. hash *= constant), use hash &= 0xFFFFFFFF to
                        chop back to 32 bits, once per iteration.
                        >
                        >
                        >What is the purpose of the calls to long() in PJWHash?
                        >>
                        trying to cast to long, looking at it now its rather superfluous.
                        >
                        >
                        >And the $64K question: What is the quintessential difference between
                        >PJWHash and ELFHash?
                        >>
                        Nothing, elf is essentially pjw, its just optimised for 32-bit systems
                        in that the calculation for th's etc are static where has pjw
                        required sizeof to calc the th's
                        You've found a C compiler where sizeof(unsigned int) is not static i.e.
                        calculated by the compiler at compile time???
                        which i couldn't find a way of doing,
                        so i fudged it in the hope that maybe sometime in the future a work
                        around of sorts could be developed.
                        Google is a wonderful thing:


                        """
                        Thanks to Josh Bloch (jbloch@eng.sun .com) who also informed me about
                        another fault that is found in Aho, Sethi and Ullman's book: The line
                        with h ^= (g >28) now replaces the original h ^= (g >24). According
                        to a correspondence of Josh Bloch with Ravi Sethi this correction will
                        be made in the next edition of the book. Including these two changes
                        this hash function is now comparable to Vo's, Torek's and WAIS's hash
                        functions.
                        """

                        (1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to
                        your website?


                        """
                        Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by
                        Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436.

                        Hash unsigned X into H, using the temporary variable G. G and H are
                        unsigned variables; X may be an expression. G is nonzero e.g. 92% of
                        time, so a conditional expression would be slower. As noted by Josh
                        Bloch, 28 is the correct replacement for the frequent misprint 24.

                        #define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf0000000, H ^= G >>
                        28, H ^= G )
                        """

                        Cheers,
                        John

                        Comment

                        • Arash Partow

                          #13
                          Re: General Hash Functions In Python

                          John Machin wrote:
                          Perhaps you should, if you profess an interest in hashed lookup -- it
                          gives some interesting commentary on the second aspect: collision
                          handling. What matters is the *total* time to return an answer.
                          >
                          Time or algorithm complexity is merely one aspect of a hash function
                          design. there are many others.
                          as far as time, well
                          people interested, as is how started my original port, would be more
                          than willing to try/assess the routines for sets of strings that they
                          wish to hash etc.
                          >
                          this site may help explain plus has some code
                          snippets that may help you understand what I mean.

                          http://www.partow.net/programming/ha...ons/index.html
                          >
                          That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's
                          a Melbourne academic named Justin Zobel who has done something amazingly
                          similar:
                          >

                          >
                          Same guy, he was a lecturer during my uni days. As far as his surname
                          that is another issue altogether.

                          Searching for "Justin Sobel" did lead me to a Russian website which
                          apart from repeating your typo/reado/whatevero did propose (and test)
                          some more hash functions.
                          >

                          >
                          In particular look at the "rot13" function which was right up near the
                          front as far as number of collisions goes, and which would appear (my
                          guess based on reading the source) to be very fast (with the right
                          compiler (e.g. gcc 4)).
                          >
                          I've already spoken to the guy that did those measurements, there
                          are some flaws in the way he represents data which could lead one
                          to make inaccurate assumptions about the "quality" of the hash
                          functions. One of the many issues that i took up with him is that
                          he only used 1 set of data instead of having multiple sets and
                          aggregating the results. Whether or not he decides to re-do is
                          analysis is another issue altogether. TAOCP has a nice section
                          on how potential analysis can be done.
                          I would have thought it important especially in the case of well-known
                          functions whose properties have been discussed in the literature that
                          you should not publish a version that gives a different answer, without
                          noting that fact prominently.
                          >
                          True, the c versions work fine, i guess the python versions require
                          a bit more work. feel free to re-post modified versions :)

                          You can't avoid using Python longs if you want to simulate unsigned
                          32-bit arithmetic. However judicious truncation can be used to stop the
                          longs becoming longer and slower. General rules:
                          1. Avoid exceeding 32 bits where possible. E.g. instead of
                          hash <<= 8
                          do
                          hash = (hash & 0xFFFFFF) << 8
                          2. Where unavoidable (e.g. hash *= constant), use hash &= 0xFFFFFFFF to
                          chop back to 32 bits, once per iteration.
                          >
                          >
                          That is something I thought about, but I also considered, as
                          I mentioned before, the extra bits. The more bits you have to
                          avalanche with - (in a very general and oversimplified way)
                          the better your hash function "can" be.
                          Thanks to Josh Bloch (jbloch@eng.sun .com) who also informed me about
                          another fault that is found in Aho, Sethi and Ullman's book: The line
                          with h ^= (g >28) now replaces the original h ^= (g >24). According
                          to a correspondence of Josh Bloch with Ravi Sethi this correction will
                          be made in the next edition of the book. Including these two changes
                          this hash function is now comparable to Vo's, Torek's and WAIS's hash
                          functions.
                          """
                          >
                          (1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to
                          your website?
                          >
                          Indeed! I had read about WAIS a long time ago, I'll be putting them up
                          very soon, thanks for the input.


                          """
                          Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by
                          Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436.
                          >
                          Hash unsigned X into H, using the temporary variable G. G and H are
                          unsigned variables; X may be an expression. G is nonzero e.g. 92% of
                          time, so a conditional expression would be slower. As noted by Josh
                          Bloch, 28 is the correct replacement for the frequent misprint 24.
                          >
                          #define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf0000000, H ^= G >>
                          28, H ^= G )
                          """
                          I'm planning on adding various integer hash functions as well, just
                          haven't had the time. the above seems like one so I'll give it a go.





                          Arash Partow
                          _______________ _______________ _______________ ___________
                          Be one who knows what they don't know,
                          Instead of being one who knows not what they don't know,
                          Thinking they know everything about all things.


                          Comment

                          Working...