On Mon, 24 May 2004 syoon_at_[hidden] wrote:
> My job has been mysteriously exited form MPI run
> with the following error messages returned.
> I'm wondering what the "singal 9" means, and why this
> happenened.
Signal 9 is SIGKILL, the uncatchable signal of death. There really aren't
a whole lot of things that can cause a SIGKILL to be sent, but most of
them are a bit tough to track down. Here's a short list:
* Out of memory errors. If the kernel (especially the Linux kernel)
decides it is under too much memory pressure, it will try to kill
off some of the processes with a SIGKILL. The most likely suspect
is the highest memory user, so if your app has a memory leak, that
could be the cause of the problems.
* Batch schedulers. If you are running under a batch scheduler and
have run out of your allocation time, the scheduler often will use
a SIGKILL to get your processes off the nodes.
* Random acts of kill. I've seen scripts that don't behave as expected
and send SIGKILLs to the wrong process. If you have some extra stuff
flying around, you might want to make sure the script isn't sending
the SIGKILL accidently. Or maybe your sysadmin just doesn't like
you? :)
Hope this helps,
Brian
|