os.wait() losing child?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jason Zheng

    os.wait() losing child?

    This may be a silly question but is possible for os.wait() to lose track
    of child processes? I'm running Python 2.4.4 on Linux kernel 2.6.20
    (i686), gcc4.1.1, and glibc-2.5.

    Here's what happened in my situation. I first created a few child
    processes with Popen, then in a while(True) loop wait on any of the
    child process to exit, then restart a child process:

    import os
    from subprocess import Popen

    pids = {}

    for i in xrange(3):
    p = Popen('sleep 1', shell=True, cwd='/home/user',
    stdout=file(os. devnull,'w'))
    pids[p.pid] = i

    while (True):
    pid = os.wait()
    i = pids[pid]
    del pids[pid]
    print "Child Process %d terminated, restarting" % i
    if (someCondition) :
    break
    p = Popen('sleep 1', shell=True, cwd='/home/user',
    stdout=file(os. devnull,'w'))
    pids[p.pid] = i

    As I started to run this program, soon I discovered that some of the
    processes stopped showing up, and eventually os.wait() will give an
    error saying that there's no more child process to wait on. Can anyone
    tell me what I did wrong?
  • greg

    #2
    Re: os.wait() losing child?

    Jason Zheng wrote:
    while (True):
    pid = os.wait()
    ...
    if (someCondition) :
    break
    ...
    Are you sure that someCondition() always becomes true
    when the list of pids is empty? If not, you may end
    up making more wait() calls than there are children.

    It might be safer to do

    while pids:
    ...

    --
    Greg

    Comment

    • Jason Zheng

      #3
      Re: os.wait() losing child?

      Hate to reply to my own thread, but this is the working program that can
      demonstrate what I posted earlier:

      import os
      from subprocess import Popen

      pids = {}
      counts = [0,0,0]

      for i in xrange(3):
      p = Popen('sleep 1', shell=True, cwd='/home',
      stdout=file(os. devnull,'w'))
      pids[p.pid] = i
      print "Starting child process %d (%d)" % (i,p.pid)

      while (True):
      (pid,exitstat) = os.wait()
      i = pids[pid]
      del pids[pid]
      counts[i]=counts[i]+1

      #terminate if count>10
      if (counts[i]==10):
      print "Child Process %d terminated." % i
      if reduce(lambda x,y: x and (y>=10), counts):
      break
      continue

      print "Child Process %d terminated, restarting" % i
      p = Popen('sleep 1', shell=True, cwd='/home',
      stdout=file(os. devnull,'w'))
      pids[p.pid] = i

      Comment

      • Jason Zheng

        #4
        Re: os.wait() losing child?

        greg wrote:
        Jason Zheng wrote:
        >while (True):
        > pid = os.wait()
        > ...
        > if (someCondition) :
        > break
        ...
        >
        Are you sure that someCondition() always becomes true
        when the list of pids is empty? If not, you may end
        up making more wait() calls than there are children.
        >
        Regardless of the nature of the someCondition, what I see from the print
        output of my python program is that some child processes never triggers
        the unblocking of os.wait() call.

        ~Jason

        Comment

        • greg

          #5
          Re: os.wait() losing child?

          Jason Zheng wrote:
          Hate to reply to my own thread, but this is the working program that can
          demonstrate what I posted earlier:
          I've figured out what's going on. The Popen class has a
          __del__ method which does a non-blocking wait of its own.
          So you need to keep the Popen instance for each subprocess
          alive until your wait call has cleaned it up.

          The following version seems to work okay.

          import os
          from subprocess import Popen

          pids = {}
          counts = [0,0,0]
          p = [None, None, None]

          for i in xrange(3):
          p[i] = Popen('sleep 1', shell=True)
          pids[p[i].pid] = i
          print "Starting child process %d (%d)" % (i,p[i].pid)

          while (True):
          (pid,exitstat) = os.wait()
          i = pids[pid]
          del pids[pid]
          counts[i]=counts[i]+1

          #terminate if count>10
          if (counts[i]==10):
          print "Child Process %d terminated." % i
          if reduce(lambda x,y: x and (y>=10), counts):
          break
          continue

          print "Child Process %d (%d) terminated, restarting" % (i, pid),
          p[i] = Popen('sleep 1', shell=True)
          pids[p[i].pid] = i
          print "(%d)" % p[i].pid

          --
          Greg

          Comment

          • Jason Zheng

            #6
            Re: os.wait() losing child?

            Greg,

            That explains it! Thanks a lot for your help. I guess this is something
            they do to prevent zombie threads?

            ~Jason

            greg wrote:
            Jason Zheng wrote:
            >Hate to reply to my own thread, but this is the working program that
            >can demonstrate what I posted earlier:
            >
            I've figured out what's going on. The Popen class has a
            __del__ method which does a non-blocking wait of its own.
            So you need to keep the Popen instance for each subprocess
            alive until your wait call has cleaned it up.
            >
            The following version seems to work okay.
            >
            import os
            from subprocess import Popen
            >
            pids = {}
            counts = [0,0,0]
            p = [None, None, None]
            >
            for i in xrange(3):
            p[i] = Popen('sleep 1', shell=True)
            pids[p[i].pid] = i
            print "Starting child process %d (%d)" % (i,p[i].pid)
            >
            while (True):
            (pid,exitstat) = os.wait()
            i = pids[pid]
            del pids[pid]
            counts[i]=counts[i]+1
            >
            #terminate if count>10
            if (counts[i]==10):
            print "Child Process %d terminated." % i
            if reduce(lambda x,y: x and (y>=10), counts):
            break
            continue
            >
            print "Child Process %d (%d) terminated, restarting" % (i, pid),
            p[i] = Popen('sleep 1', shell=True)
            pids[p[i].pid] = i
            print "(%d)" % p[i].pid
            >
            --
            Greg

            Comment

            • Matthew Woodcraft

              #7
              Re: os.wait() losing child?

              greg <greg@cosc.cant erbury.ac.nzwro te:
              I've figured out what's going on. The Popen class has a
              __del__ method which does a non-blocking wait of its own.
              So you need to keep the Popen instance for each subprocess
              alive until your wait call has cleaned it up.
              I don't think this will be enough for the poster, who has Python 2.4:
              in that version, opening a new Popen object would trigger the wait on
              all 'outstanding' Popen-managed subprocesses.

              It seems to me that subprocess.py assumes that it will do all wait()ing
              on its children itself; I'm not sure if it's safe to rely on the
              details of how this is currently arranged.

              Perhaps a better way would be for subprocess.py to provide its own
              variant of os.wait() for people who want 'wait-for-any-child' (though
              it would be hard to support programs which also had children not
              managed by subprocess.py).

              -M-

              Comment

              • Jason Zheng

                #8
                Re: os.wait() losing child?

                greg wrote:
                Jason Zheng wrote:
                >Hate to reply to my own thread, but this is the working program that
                >can demonstrate what I posted earlier:
                >
                I've figured out what's going on. The Popen class has a
                __del__ method which does a non-blocking wait of its own.
                So you need to keep the Popen instance for each subprocess
                alive until your wait call has cleaned it up.
                >
                The following version seems to work okay.
                >
                It still doesn't work on my machine. I took a closer look at the Popen
                class, and I think the problem is that the __init__ method always calls
                a method _cleanup, which polls every existing Popen instance. The poll
                method does a nonblocking wait.

                If one of my child process finishes as I create a new Popen instance,
                then the _cleanup method effectively de-zombifies the child process, so
                I can no longer expect to see the return of that pid on os.wait() any more.

                ~Jason

                Comment

                • Jason Zheng

                  #9
                  Re: os.wait() losing child?

                  Matthew Woodcraft wrote:
                  greg <greg@cosc.cant erbury.ac.nzwro te:
                  >I've figured out what's going on. The Popen class has a
                  >__del__ method which does a non-blocking wait of its own.
                  >So you need to keep the Popen instance for each subprocess
                  >alive until your wait call has cleaned it up.
                  >
                  I don't think this will be enough for the poster, who has Python 2.4:
                  in that version, opening a new Popen object would trigger the wait on
                  all 'outstanding' Popen-managed subprocesses.
                  >
                  It seems to me that subprocess.py assumes that it will do all wait()ing
                  on its children itself; I'm not sure if it's safe to rely on the
                  details of how this is currently arranged.
                  >
                  Perhaps a better way would be for subprocess.py to provide its own
                  variant of os.wait() for people who want 'wait-for-any-child' (though
                  it would be hard to support programs which also had children not
                  managed by subprocess.py).
                  >
                  -M-
                  >
                  Thanks, that's exactly what I need, my program really needs the
                  os.wait() to be reliable. Perhaps I could pass a flag to Popen to tell
                  it to never os.wait() on the new pid (but it's ok to os.wait() on other
                  Popen instances upon _cleanup()).

                  Comment

                  • Nick Craig-Wood

                    #10
                    Re: os.wait() losing child?

                    Jason Zheng <Xin.Zheng@jpl. nasa.govwrote:
                    greg wrote:
                    Jason Zheng wrote:
                    Hate to reply to my own thread, but this is the working program that
                    can demonstrate what I posted earlier:
                    I've figured out what's going on. The Popen class has a
                    __del__ method which does a non-blocking wait of its own.
                    So you need to keep the Popen instance for each subprocess
                    alive until your wait call has cleaned it up.

                    The following version seems to work okay.
                    It still doesn't work on my machine. I took a closer look at the Popen
                    class, and I think the problem is that the __init__ method always calls
                    a method _cleanup, which polls every existing Popen instance. The poll
                    method does a nonblocking wait.
                    >
                    If one of my child process finishes as I create a new Popen instance,
                    then the _cleanup method effectively de-zombifies the child process, so
                    I can no longer expect to see the return of that pid on os.wait()
                    any more.
                    The problem you are having is you are letting Popen do half the job
                    and doing the other half yourself.

                    Here is a way which works, done completely with Popen. Polling the
                    subprocesses is slightly less efficient than using os.wait() but does
                    work. In practice you want to do this anyway to see if your children
                    exceed their time limits etc.

                    import os
                    import time
                    from subprocess import Popen

                    processes = []
                    counts = [0,0,0]

                    for i in xrange(3):
                    p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os. devnull,'w'))
                    processes.appen d(p)
                    print "Starting child process %d (%d)" % (i, p.pid)

                    while (True):
                    for i,p in enumerate(proce sses):
                    exitstat = p.poll()
                    pid = p.pid
                    if exitstat is not None:
                    break
                    else:
                    time.sleep(0.1)
                    continue
                    counts[i]=counts[i]+1

                    #terminate if count>10
                    if (counts[i]==10):
                    print "Child Process %d terminated." % i
                    if reduce(lambda x,y: x and (y>=10), counts):
                    break
                    continue

                    print "Child Process %d terminated, restarting" % i
                    processes[i] = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os. devnull,'w'))



                    --
                    Nick Craig-Wood <nick@craig-wood.com-- http://www.craig-wood.com/nick

                    Comment

                    • Jason Zheng

                      #11
                      Re: os.wait() losing child?

                      Nick Craig-Wood wrote:
                      The problem you are having is you are letting Popen do half the job
                      and doing the other half yourself.
                      Except that I never wanted Popen to do any thread management for me to
                      begin with. Popen class has advertised itself as a replacement for
                      os.popen, popen2, popen4, and etc., and IMHO it should leave the
                      clean-up to the users, or at least leave it as an option.
                      Here is a way which works, done completely with Popen. Polling the
                      subprocesses is slightly less efficient than using os.wait() but does
                      work. In practice you want to do this anyway to see if your children
                      exceed their time limits etc.
                      I think your polling way works; it seems there no other way around this
                      problem other than polling or extending Popen class.

                      thanks,

                      Jason

                      Comment

                      • Nick Craig-Wood

                        #12
                        Re: os.wait() losing child?

                        Jason Zheng <Xin.Zheng@jpl. nasa.govwrote:
                        Nick Craig-Wood wrote:
                        The problem you are having is you are letting Popen do half the job
                        and doing the other half yourself.
                        >
                        Except that I never wanted Popen to do any thread management for me to
                        begin with. Popen class has advertised itself as a replacement for
                        os.popen, popen2, popen4, and etc., and IMHO it should leave the
                        clean-up to the users, or at least leave it as an option.
                        >
                        Here is a way which works, done completely with Popen. Polling the
                        subprocesses is slightly less efficient than using os.wait() but does
                        work. In practice you want to do this anyway to see if your children
                        exceed their time limits etc.
                        >
                        I think your polling way works; it seems there no other way around this
                        problem other than polling or extending Popen class.
                        I think polling is probably the right way of doing it...

                        Internally subprocess uses os.waitpid(pid) just waiting for its own
                        specific pids. IMHO this is the right way of doing it other than
                        os.wait() which waits for any pids. os.wait() can reap children that
                        you weren't expecting (say some library uses os.system())...

                        --
                        Nick Craig-Wood <nick@craig-wood.com-- http://www.craig-wood.com/nick

                        Comment

                        • Hrvoje Niksic

                          #13
                          Re: os.wait() losing child?

                          Jason Zheng <Xin.Zheng@jpl. nasa.govwrites:
                          greg wrote:
                          >Jason Zheng wrote:
                          >>Hate to reply to my own thread, but this is the working program
                          >>that can demonstrate what I posted earlier:
                          >I've figured out what's going on. The Popen class has a
                          >__del__ method which does a non-blocking wait of its own.
                          >So you need to keep the Popen instance for each subprocess
                          >alive until your wait call has cleaned it up.
                          >The following version seems to work okay.
                          >>
                          It still doesn't work on my machine. I took a closer look at the Popen
                          class, and I think the problem is that the __init__ method always
                          calls a method _cleanup, which polls every existing Popen
                          instance.
                          Actually, it's not that bad. _cleanup only polls the instances that
                          are no longer referenced by user code, but still running. If you hang
                          on to Popen instances, they won't be added to _active, and __init__
                          won't reap them (_active is only populated from Popen.__del__).

                          This version is a trivial modification of your code to that effect.
                          Does it work for you?

                          #!/usr/bin/python

                          import os
                          from subprocess import Popen

                          pids = {}
                          counts = [0,0,0]

                          for i in xrange(3):
                          p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os. devnull,'w'))
                          pids[p.pid] = p, i
                          print "Starting child process %d (%d)" % (i,p.pid)

                          while (True):
                          pid, ignored = os.wait()
                          try:
                          p, i = pids[pid]
                          except KeyError:
                          # not one of ours
                          continue
                          del pids[pid]
                          counts[i] += 1

                          #terminate if count>10
                          if (counts[i]==10):
                          print "Child Process %d terminated." % i
                          if reduce(lambda x,y: x and (y>=10), counts):
                          break
                          continue

                          print "Child Process %d terminated, restarting" % i
                          p = Popen('sleep 1', shell=True, cwd='/home', stdout=file(os. devnull,'w'))
                          pids[p.pid] = p, i

                          Comment

                          • Hrvoje Niksic

                            #14
                            Re: os.wait() losing child?

                            Nick Craig-Wood <nick@craig-wood.comwrites:
                            > I think your polling way works; it seems there no other way around this
                            > problem other than polling or extending Popen class.
                            >
                            I think polling is probably the right way of doing it...
                            It requires the program to wake up every 0.1s to poll for freshly
                            exited subprocesses. That doesn't consume excess CPU cycles, but it
                            does prevent the kernel from swapping it out when there is nothing to
                            do. Sleeping in os.wait allows the operating system to know exactly
                            what the process is waiting for, and to move it out of the way until
                            those conditions are met. (Pedants would also notice that polling
                            introduces on average 0.1/2 seconds delay between the subprocess dying
                            and the parent reaping it.)

                            In general, a program that waits for something should do so in a
                            single call to the OS. OP's usage of os.wait was exactly correct.

                            Fortunately the problem can be worked around by hanging on to Popen
                            instances until they are reaped. If all of them are kept referenced
                            when os.wait is called, they will never end up in the _active list
                            because the list is only populated in Popen.__del__.
                            Internally subprocess uses os.waitpid(pid) just waiting for its own
                            specific pids. IMHO this is the right way of doing it other than
                            os.wait() which waits for any pids. os.wait() can reap children
                            that you weren't expecting (say some library uses os.system())...
                            system calls waitpid immediately after the fork. This can still be a
                            problem for applications that call wait in a dedicated thread, but the
                            program can always ignore the processes it doesn't know anything
                            about.

                            Comment

                            • Jason Zheng

                              #15
                              Re: os.wait() losing child?

                              Hrvoje Niksic wrote:
                              >greg wrote:
                              >
                              Actually, it's not that bad. _cleanup only polls the instances that
                              are no longer referenced by user code, but still running. If you hang
                              on to Popen instances, they won't be added to _active, and __init__
                              won't reap them (_active is only populated from Popen.__del__).
                              >
                              Perhaps that's the difference between Python 2.4 and 2.5. In 2.4,
                              Popen's __init__ always appends self to _active:

                              def __init__(...):
                              _cleanup()
                              ...
                              self._execute_c hild(...)
                              ...
                              _active.append( self)

                              This version is a trivial modification of your code to that effect.
                              Does it work for you?
                              >
                              Nope it still doesn't work. I'm running python 2.4.4, tho.

                              $ python test.py
                              Starting child process 0 (26497)
                              Starting child process 1 (26498)
                              Starting child process 2 (26499)
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated, restarting
                              Child Process 2 terminated.
                              Traceback (most recent call last):
                              File "test.py", line 15, in ?
                              pid, ignored = os.wait()
                              OSError: [Errno 10] No child processes

                              Comment

                              Working...