On Thu, Jun 05, 2003 at 08:15:32AM -0500, Brian W. Barrett wrote:
> OpenPBS/PBS Pro are *supposed* to send a SIGTERM, wait a couple seconds,
> then send a SIGKILL. PBS Pro does this pretty much all the time every
> time. OpenPBS seems to have certain times when it just doesn't bother to
> do anything. We use the SIGTERM as our signal to clean up all the shared
> memory segments and kill all our processes and all that. If we don't get
> it, that we are basically out of luck as far as cleaning up goes.
We've seen similar behaviour on our OpenPBS systems. (This is with
patched LAM 6.6b1 and mpiexec, but I'd guess it's the same problem.)
While it's true that qdel sends SIGTERM followed shortly afterwards by
SIGKILL, the PBS MOM sends SIGKILL to all spawned processes as soon as the
controlling shell exits, so LAM can get killed with SIGKILL rather than
SIGTERM if the shell gets SIGTERM first. We use a quick kludge to the MOM
source code to fix this on our systems, and it seems to work pretty well. See
mom_softkill.patch at http://bellatrix.pcl.ox.ac.uk/~ben/pbs/.
Ben
--
ben_at_[hidden] http://bellatrix.pcl.ox.ac.uk/~ben/
"So you found a girl who thinks really deep thoughts,
What's so amazing about really deep thoughts?"
|