"Do this, and come back when you're done"

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Kamus of Kadizhar

    "Do this, and come back when you're done"

    I have the following function which generates MD5 hashes for files on a
    local and remote server. The remote server has a little applet that
    runs from inetd and generates an MD5 hash given the file name.

    The problem is that it takes 2+ minutes to generate the MD5 hash, so
    this function takes about 5 minutes every time it is called. Since the
    first MD5 hash is generated on a remote machine, the local machine does
    nothing but wait for half that time.

    Is there any way to rewrite each half of the function to run in the
    background, so to speak, and then have a master process that waits on
    the results? This would cut execution time in half more or less.

    # checkMD5
    def checkMD5(fileNa me, localDir):
    # get remote hash
    Socket = socket.socket(s ocket.AF_INET,s ocket.SOCK_STRE AM)
    Socket.connect( (MD5server,888) )
    #throw away ID string
    Socket.recv(256 )
    Socket.send(fil eName+'\n')
    remoteMD5hash = Socket.recv(256 )

    # get local hash
    try:
    file=open(makeM ovieName(localD ir,fileName), 'r')
    except IOError:
    localMD5hash = '0'
    else:
    hasher = md5.new()
    while True:
    chunk = file.read(1024)
    if not chunk:
    break
    hasher.update(c hunk)
    localMD5hash = hasher.hexdiges t()
    if Debug: print "local:",localM D5hash, "remote:",remot eMD5hash
    return localMD5hash.st rip() == remoteMD5hash.s trip()

    -Kamus

    --
    o__ | If you're old, eat right and ride a decent bike.
    ,>/'_ | Q.
    (_)\(_) | Usenet posting`

  • Paul Rubin

    #2
    Re: "Do this, and come back when you're done"

    Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> writes:[color=blue]
    > Is there any way to rewrite each half of the function to run in the
    > background, so to speak, and then have a master process that waits on
    > the results? This would cut execution time in half more or less.[/color]

    Sure, use the threading module. Think about another aspect of what
    you're doing though. You're comparing the md5's of a local and remote
    copy of the same file, to see if they're the same. Are you trying to
    detect malicious tampering? If someone tampered with one of the
    files, how do you know that person can't also intercept your network
    connection and send you the "correct" md5, so you won't detect the
    tampering? Or for that matter, do you know that the remote copy of
    the program itself hasn't been tampered with?

    Comment

    • Kamus of Kadizhar

      #3
      Re: &quot;Do this, and come back when you're done&quot;

      Paul Rubin wrote:[color=blue]
      > Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> writes:
      >[color=green]
      >>Is there any way to rewrite each half of the function to run in the
      >>background, so to speak, and then have a master process that waits on
      >>the results? This would cut execution time in half more or less.[/color]
      >
      >
      > Sure, use the threading module.[/color]

      OK, I'll read up on that. I've written gobs of scientific type code,
      but this OS stuff is new to me.
      [color=blue]
      > Think about another aspect of what
      > you're doing though. You're comparing the md5's of a local and remote
      > copy of the same file, to see if they're the same. Are you trying to
      > detect malicious tampering?[/color]

      No, actually, both machines are under my control (and in my house). I'm
      slinging large (1GB MOL) files around on an unreliable, slow wireless
      network. I am trying to detect an incomplete copy across the network.
      The local machine is the video player and the remote machine is the
      archive server. My kids have a habit of just shutting down the video
      server, resulting in incomplete transfers to the archives.



      If it's appropriate for this newsgroup, I'd like to post the entire
      effort for comments (it's my first bit of pyton code.) So far, python
      has been the easiest language to learn I've ever come across. I tried
      learning perl, and it was a disaster.... Too convoluted. Python is a
      breath of fresh air. Also, the docs and support here is excellent.
      :-) My thanks to all the volunteers who put in time to build python.



      -Kamus


      --
      o__ | If you're old, eat right and ride a decent bike.
      ,>/'_ | Q.
      (_)\(_) | Usenet posting`

      Comment

      • Paul Rubin

        #4
        Re: &quot;Do this, and come back when you're done&quot;

        Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> writes:[color=blue]
        > No, actually, both machines are under my control (and in my house).
        > I'm slinging large (1GB MOL) files around on an unreliable, slow
        > wireless network. I am trying to detect an incomplete copy across the
        > network. The local machine is the video player and the remote machine
        > is the archive server. My kids have a habit of just shutting down the
        > video server, resulting in incomplete transfers to the archives. If
        > it's appropriate for this newsgroup, I'd like to post the entire
        > effort for comments (it's my first bit of pyton code.) So far, python
        > has been the easiest language to learn I've ever come across. I tried
        > learning perl, and it was a disaster.... Too convoluted. Python is a
        > breath of fresh air. Also, the docs and support here is
        > excellent. :-) My thanks to all the volunteers who put in time to
        > build python.[/color]

        Why don't you look at the rsync program. It brings two machines into
        sync with each other by automatically detecting differences between
        files and sending only the deltas over the network.

        Comment

        • Kamus of Kadizhar

          #5
          Re: &quot;Do this, and come back when you're done&quot;

          Paul Rubin wrote:
          [color=blue]
          > Why don't you look at the rsync program. It brings two machines into
          > sync with each other by automatically detecting differences between
          > files and sending only the deltas over the network.[/color]

          Well, the purpose of this whole project was to learn python. I did look
          at the pysync modules (rsync written in python), but it's too
          complicated for me at the moment.

          -Kamus

          --
          o__ | If you're old, eat right and ride a decent bike.
          ,>/'_ | Q.
          (_)\(_) | Usenet posting`

          Comment

          • Roy Smith

            #6
            Re: &quot;Do this, and come back when you're done&quot;

            Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> wrote:[color=blue]
            > Is there any way to rewrite each half of the function to run in the
            > background, so to speak, and then have a master process that waits on
            > the results?[/color]

            Yup. Two ways in fact.

            The traditional way would be to fork another process to do the work and
            have the parent process wait for the child to finish. You'll need to
            use the fork() and exec() functions that can be found in the os module.

            The other way would be to do something similar, but with threads instead
            of processes. The basic flow is the same; you create a thread, have
            that thread do the stuff that takes a long time, and then rejoin with
            the primary thread. Of course (just like with child processes), you
            could have multiple of these running at the same time doing different
            parts of a parallelizable job. Take a look at the Threading module.

            I'm intentionally not including any sample code here, because the
            possibilities are numerous. Exactly how you do it depends on many
            factors. I'm guessing that doing it with threads is what you really
            want to do, so my suggestion would be to start by reading up on the
            Threading module and playing with some examples to get the feel for how
            it works. Working with threads is becomming more and more mainstream
            and more operating systems and languages provide support for it, and the
            programming community at large becomes more familiar and comfortable
            with the issues involved.

            Comment

            • Valentino Volonghi aka Dialtone

              #7
              Re: &quot;Do this, and come back when you're done&quot;

              Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> writes:
              [color=blue]
              > Is there any way to rewrite each half of the function to run in the
              > background, so to speak, and then have a master process that waits on
              > the results? This would cut execution time in half more or less.[/color]

              Why don't you use twisted? It's a net framework with a lot of
              protocols (and you can define your own ones) and it's based on async
              sockets which let you write programs avoiding threads for most of the
              times.



              I'm sure you will find out that's the best thing ever done for python
              :)

              --
              Valentino Volonghi, Regia SpA, Milan
              Linux User #310274, Gentoo Proud User

              Comment

              • Paul Rubin

                #8
                Re: &quot;Do this, and come back when you're done&quot;

                "Donn Cave" <donn@drizzle.c om> writes:[color=blue]
                > Yes. I may be missing something here, because the followups
                > I have seen strike me as somewhat misguided, if they're not
                > just fooling with you. You already have two independent threads
                > or processes here, one on each machine. All you need to do is
                > take the results from the remote machine AFTER the local computation.
                > Move the line that says "remoteMD5h ash = Socket.recv(256 )" to
                > after the block that ends with "localMD5ha sh = hasher.hexdiges t()".
                > No?[/color]

                Can the remote process time out if the local side takes too long to
                read from the socket? That could happen if the two machines aren't
                the same speed.

                Comment

                • Donn Cave

                  #9
                  Re: &quot;Do this, and come back when you're done&quot;

                  Quoth Paul Rubin <http://phr.cx@NOSPAM.i nvalid>:
                  ....
                  | Can the remote process time out if the local side takes too long to
                  | read from the socket? That could happen if the two machines aren't
                  | the same speed.

                  I wouldn't expect so. I'm no expert in such things, but I would
                  expect the remote process to return from send(), and exit; the
                  data would be waiting in a kernel mbuf on the local side

                  Donn Cave, donn@drizzle.co m

                  Comment

                  • Alan Kennedy

                    #10
                    Re: &quot;Do this, and come back when you're done&quot;

                    [Kamus of Kadizhar][color=blue]
                    > So far, python
                    > has been the easiest language to learn I've ever come across. I tried
                    > learning perl, and it was a disaster.... Too convoluted. Python is a
                    > breath of fresh air. Also, the docs and support here is excellent.
                    > :-) My thanks to all the volunteers who put in time to build python.[/color]

                    +1 QOTW.

                    regards,

                    --
                    alan kennedy
                    ------------------------------------------------------
                    check http headers here: http://xhaus.com/headers
                    email alan: http://xhaus.com/contact/alan

                    Comment

                    • Peter Hansen

                      #11
                      Re: &quot;Do this, and come back when you're done&quot;

                      Valentino Volonghi aka Dialtone wrote:[color=blue]
                      >
                      > Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> writes:
                      >[color=green]
                      > > Is there any way to rewrite each half of the function to run in the
                      > > background, so to speak, and then have a master process that waits on
                      > > the results? This would cut execution time in half more or less.[/color]
                      >
                      > Why don't you use twisted? It's a net framework with a lot of
                      > protocols (and you can define your own ones) and it's based on async
                      > sockets which let you write programs avoiding threads for most of the
                      > times.
                      >
                      > www.twistedmatrix.com
                      >
                      > I'm sure you will find out that's the best thing ever done for python
                      > :)[/color]

                      I second that advice, and will also mention that it would avoid the sort
                      of bug that I pointed out in your first post, involving the simplistic
                      ..recv(256) calls you are doing. Twisted would make the code much more
                      readable *and* reliable. Well worth learning. If you're doing this
                      just to learn Python, you could do worse than get it working with Twisted,
                      then go poking into the Twisted internals to see how *it* works instead.


                      -Peter

                      Comment

                      • Nick Vargish

                        #12
                        Re: &quot;Do this, and come back when you're done&quot;

                        Kamus of Kadizhar <yan@NsOeSiPnAe Mr.com> writes:
                        [color=blue]
                        > No, actually, both machines are under my control (and in my house).
                        > I'm slinging large (1GB MOL) files around on an unreliable, slow
                        > wireless network. I am trying to detect an incomplete copy across the
                        > network.[/color]

                        If you're checking for incomplete copies, then md5 is overkill. Just
                        make sure the file sizes match.

                        If you're checking for corruption, then maybe doing an md5 sum would
                        help, but again, you only need to do that if the files are the same
                        size.

                        The Python Cookbook site has a recipe that lets you farm out "jobs" to
                        "worker threads", which might help you if you do go with checksumming
                        every file:



                        Nick

                        --
                        # sigmask || 0.2 || 20030107 || public domain || feed this to a python
                        print reduce(lambda x,y:x+chr(ord(y )-1),' Ojdl!Wbshjti!=o bwAcboefstobudi/psh?')

                        Comment

                        Working...