Efficient MD5 (or similar) hashes

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Kamus of Kadizhar

    Efficient MD5 (or similar) hashes

    ANother newbie question:

    I have large files I'm dealing with. Some 600MB -1.2 GB in size, over a
    slow network. Transfer of one of these files can take 40 minutes or an
    hour.

    I want to check the integrity of the files after transfer. I can check
    the obvious - date, file size - quickly, but what if I want an MD5 hash?

    From reading the python docs, md5 reads the entire file as a string.
    That's not practical on a 1 GB file that's network mounted.

    The only thing I can think of is to set up an inetd daemon on the server
    that will spit out the md5 hash if given the file path/name.

    Any other ideas?

    -Kamus

    --
    What am I on?
    I'm on my bike, o__
    6 hours a day, busting my ass. ,>/'_
    What are you on? --Lance Armstrong (_)\(_)

  • Erik Max Francis

    #2
    Re: Efficient MD5 (or similar) hashes

    Kamus of Kadizhar wrote:
    [color=blue]
    > I want to check the integrity of the files after transfer. I can
    > check
    > the obvious - date, file size - quickly, but what if I want an MD5
    > hash?
    >
    > From reading the python docs, md5 reads the entire file as a string.
    > That's not practical on a 1 GB file that's network mounted.[/color]

    Python's md5 module just accepts updating strings; the driving code
    certainly doesn't have to read the file all in at once. Just read it in
    a chunk at a time:

    hasher = md5.new()
    while True:
    chunk = theFile.read(CH UNK_SIZE)
    if not chunk:
    break
    hasher.update(c hunk)
    theHash = hasher.hexdiges t()

    --
    __ Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
    / \ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
    \__/ Be able to be alone. Lose not the advantage of solitude.
    -- Sir Thomas Browne

    Comment

    • Bengt Richter

      #3
      Re: Efficient MD5 (or similar) hashes

      On Sun, 07 Dec 2003 19:49:58 -0500, Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> wrote:
      [color=blue]
      >ANother newbie question:
      >
      >I have large files I'm dealing with. Some 600MB -1.2 GB in size, over a
      >slow network. Transfer of one of these files can take 40 minutes or an
      >hour.
      >
      >I want to check the integrity of the files after transfer. I can check
      >the obvious - date, file size - quickly, but what if I want an MD5 hash?
      >
      > From reading the python docs, md5 reads the entire file as a string.[/color]
      I don't know what docs you're reading, but if your read the docs on the
      md5 module, you'll see you don't have to do that.
      also you could interactively type help('md5')
      or import md5 followed by help(md5)
      [color=blue]
      >That's not practical on a 1 GB file that's network mounted.[/color]
      Well, whatever calculates the md5 will have to read all the bytes from the source
      you want to check. If you have downloaded a file to another machine, then
      the fastest will be to run the md5 calculation there, but if you have a gigabit lan
      connection and things aren't busy, IWT it wouldn't make much difference if you
      read it that way.

      If you have a c/c++ excutable utility that will calculate md5, it will probably
      be fastest to run that on the file. You can run it from python via popen, if that's
      the context you want to control it from.

      I think there's ways to RPC to accomplish the same remotely, but I haven't played with that.
      [color=blue]
      >
      >The only thing I can think of is to set up an inetd daemon on the server
      >that will spit out the md5 hash if given the file path/name.
      >
      >Any other ideas?[/color]

      Describe your setup in a little more detail. Someone has probably done it before.

      Regards,
      Bengt Richter

      Comment

      • Bengt Richter

        #4
        Re: Efficient MD5 (or similar) hashes

        On Sun, 07 Dec 2003 17:21:04 -0800, Erik Max Francis <max@alcyone.co m> wrote:
        [color=blue]
        >Kamus of Kadizhar wrote:
        >[color=green]
        >> I want to check the integrity of the files after transfer. I can
        >> check
        >> the obvious - date, file size - quickly, but what if I want an MD5
        >> hash?
        >>
        >> From reading the python docs, md5 reads the entire file as a string.
        >> That's not practical on a 1 GB file that's network mounted.[/color]
        >
        >Python's md5 module just accepts updating strings; the driving code
        >certainly doesn't have to read the file all in at once. Just read it in
        >a chunk at a time:
        >[/color]
        PMJI, but don't forget to open the file in binary,
        e.g., theFile = file(thePath, 'rb'), if you're on windows.
        [color=blue]
        > hasher = md5.new()
        > while True:
        > chunk = theFile.read(CH UNK_SIZE)
        > if not chunk:
        > break
        > hasher.update(c hunk)
        > theHash = hasher.hexdiges t()
        >[/color]

        Regards,
        Bengt Richter

        Comment

        Working...