On Dec 13, 2005, at 8:56 PM, Bogdan Costescu wrote:
> On Tue, 13 Dec 2005, Jeff Squyres wrote:
>
>>> PID 5074 failed on node n0 (134.153.50.235) due to signal 15.
>> I'm assuming that this is a linux system -- signal 15 is ENOTBLK.
>
> Err, no. Signal 15 is SIGTERM as shown by /usr/include/bits/signum.h
> ... you mistakenly looked at errno.h as all signal names start with
> SIG :-)
Haha - good catch.
> But this doesn't say much about the reason for terminating the
> parallel job... Maybe the remote shell is not clean - does it write
> something on stdout or stderr ?
It's awful hard to get reasonable feedback to the user on a SIGTERM.
It's also very strange that SIGTERM is what is causing the processes
to die. The only reasons LAM will send a SIGTERM to a process are
the user running lamhalt, wipe, or lamclean. If you are running a
batch scheduler, it's also possible that it is causing the signal to
be generated. You might want to run your application under a
debugger to see if that helps pinpoint where the signal is coming
from. Information on running a LAM job under a debugger can be found
on our FAQ:
http://www.lam-mpi.org/faq/
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|