Calculate sha1 hash of a binary file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • LaundroMat

    Calculate sha1 hash of a binary file

    Hi -

    I'm trying to calculate unique hash values for binary files,
    independent of their location and filename, and I was wondering
    whether I'm going in the right direction.

    Basically, the hash values are calculated thusly:

    f = open('binaryfil e.bin')
    import hashlib
    h = hashlib.sha1()
    h.update(f.read ())
    hash = h.hexdigest()
    f.close()

    A quick try-out shows that effectively, after renaming a file, its
    hash remains the same as it was before.

    I have my doubts however as to the usefulness of this. As f.read()
    does not seem to read until the end of the file (for a 3.3MB file only
    a string of 639 bytes is being returned, perhaps a 00-byte counts as
    EOF?), is there a high danger for collusion?

    Are there better ways of calculating hash values of binary files?

    Thanks in advance,

    Mathieu
  • Tim Golden

    #2
    Re: Calculate sha1 hash of a binary file

    LaundroMat wrote:
    Hi -
    >
    I'm trying to calculate unique hash values for binary files,
    independent of their location and filename, and I was wondering
    whether I'm going in the right direction.
    >
    Basically, the hash values are calculated thusly:
    >
    f = open('binaryfil e.bin')
    import hashlib
    h = hashlib.sha1()
    h.update(f.read ())
    hash = h.hexdigest()
    f.close()
    >
    A quick try-out shows that effectively, after renaming a file, its
    hash remains the same as it was before.
    >
    I have my doubts however as to the usefulness of this. As f.read()
    does not seem to read until the end of the file (for a 3.3MB file only
    a string of 639 bytes is being returned, perhaps a 00-byte counts as
    EOF?), is there a high danger for collusion?
    Guess: you're running on Windows?

    You need to open binary files by using open ("filename", "rb")
    to indicate that Windows shouldn't treat certain characters --
    specifically character 26 -- as special.

    TJG

    Comment

    • John Krukoff

      #3
      Re: Calculate sha1 hash of a binary file


      On Wed, 2008-08-06 at 12:31 -0700, LaundroMat wrote:
      Hi -
      >
      I'm trying to calculate unique hash values for binary files,
      independent of their location and filename, and I was wondering
      whether I'm going in the right direction.
      >
      Basically, the hash values are calculated thusly:
      >
      f = open('binaryfil e.bin')
      import hashlib
      h = hashlib.sha1()
      h.update(f.read ())
      hash = h.hexdigest()
      f.close()
      >
      A quick try-out shows that effectively, after renaming a file, its
      hash remains the same as it was before.
      >
      I have my doubts however as to the usefulness of this. As f.read()
      does not seem to read until the end of the file (for a 3.3MB file only
      a string of 639 bytes is being returned, perhaps a 00-byte counts as
      EOF?), is there a high danger for collusion?
      >
      Are there better ways of calculating hash values of binary files?
      >
      Thanks in advance,
      >
      Mathieu
      --
      http://mail.python.org/mailman/listinfo/python-list
      Looks like you're doing the right thing from here. file.read( ) with no
      size parameter will always return the whole file (for completeness, I'll
      mention that the documentation warns this is not the case if the file is
      in non-blocking mode, which you're not doing).

      Python never treats null bytes as special in strings, so no, you're not
      getting an early EOF due to that.

      I wouldn't worry about your hashing code, that looks fine, if I were you
      I'd try and figure out what's going wrong with your file handles. I
      would suspect that in where ever you saw your short read, you were
      likely not opening the file in the correct mode or did not rewind the
      file ( with file.seek( 0 ) ) after having previously read data from it.

      You'll be fine if you use the code above as is, there's no problems I
      can see with it.
      --
      John Krukoff <jkrukoff@ltgc. com>
      Land Title Guarantee Company

      Comment

      • Nikolaus Rath

        #4
        Re: Calculate sha1 hash of a binary file

        LaundroMat <Laundro@gmail. comwrites:
        Hi -
        >
        I'm trying to calculate unique hash values for binary files,
        independent of their location and filename, and I was wondering
        whether I'm going in the right direction.
        >
        Basically, the hash values are calculated thusly:
        >
        f = open('binaryfil e.bin')
        import hashlib
        h = hashlib.sha1()
        h.update(f.read ())
        hash = h.hexdigest()
        f.close()
        >
        A quick try-out shows that effectively, after renaming a file, its
        hash remains the same as it was before.
        >
        I have my doubts however as to the usefulness of this. As f.read()
        does not seem to read until the end of the file (for a 3.3MB file only
        a string of 639 bytes is being returned, perhaps a 00-byte counts as
        EOF?), is there a high danger for collusion?
        >
        Are there better ways of calculating hash values of binary files?

        Apart from opening the file in binary mode, I would consider to read
        and update the hash in chunks of e.g. 512 KB. The above code is
        probably going to perform horribly for sufficiently large files, since
        you try read the entire file into memory.


        Best,

        -Nikolaus

        --
        »It is not worth an intelligent man's time to be in the majority.
        By definition, there are already enough people to do that.«
        -J.H. Hardy

        PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

        Comment

        • LaundroMat

          #5
          Re: Calculate sha1 hash of a binary file

          Thanks all!

          Comment

          • LaundroMat

            #6
            Re: Calculate sha1 hash of a binary file

            I did some testing, and calculating the hash value of a 1Gb file does
            take some time using this method.
            Would it be wise to calculate the hash value based on say for instance
            the first Mb? Is there a much larger chance of collusion this way (I
            suppose not). If it's helpful, the files would primarily be media
            (video) files.

            Thanks,

            Mathieu

            Comment

            • Paul Rubin

              #7
              Re: Calculate sha1 hash of a binary file

              LaundroMat <Laundro@gmail. comwrites:
              Would it be wise to calculate the hash value based on say for instance
              the first Mb? Is there a much larger chance of collusion this way (I
              suppose not). If it's helpful, the files would primarily be media
              (video) files.
              The usual purpose of using this type of hash is to detect corruption
              and/or tampering. So you want to hash the whole file, not just part
              of it. If you're not worried about intentional tampering, md5 should
              be somewhat faster than sha, but there are some attacks against it
              and you shouldn't use it for high security applications where you
              want security against forgery. It should still have almost no chance
              of accidental collisions.

              Comment

              Working...