LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2004-05-24 13:01:13


On Mon, 24 May 2004 syoon_at_[hidden] wrote:

> My job has been mysteriously exited form MPI run
> with the following error messages returned.
> I'm wondering what the "singal 9" means, and why this
> happenened.

Signal 9 is SIGKILL, the uncatchable signal of death. There really aren't
a whole lot of things that can cause a SIGKILL to be sent, but most of
them are a bit tough to track down. Here's a short list:

  * Out of memory errors. If the kernel (especially the Linux kernel)
    decides it is under too much memory pressure, it will try to kill
    off some of the processes with a SIGKILL. The most likely suspect
    is the highest memory user, so if your app has a memory leak, that
    could be the cause of the problems.

  * Batch schedulers. If you are running under a batch scheduler and
    have run out of your allocation time, the scheduler often will use
    a SIGKILL to get your processes off the nodes.

  * Random acts of kill. I've seen scripts that don't behave as expected
    and send SIGKILLs to the wrong process. If you have some extra stuff
    flying around, you might want to make sure the script isn't sending
    the SIGKILL accidently. Or maybe your sysadmin just doesn't like
    you? :)

Hope this helps,

Brian