SIGSEGV means that your process has encountered a seg fault. This is
possibly the fault of your application; I would strongly urge you to
run your application through a memory-checking debugger such as
valgrind (see the debugging section in the LAM FAQ).
SIGKILL means that an external agent has killed your process (i.e.,
"kill -9 <your_pid>"). SIGTERM usually occurs when batch schedulers
are killing your job when it has run out of time (but can occur at
other times, too). So these signals have originated from outside your
process -- you might want to find where they're coming from.
If some external agent is killing your process (like a batch system),
it could be killing your lamds as well.
That being said, there have been several bug-fix releases since 7.0.
The latest stable version of the 7.0 series is 7.0.6. If nothing else,
you might want to upgrade.
On Mar 31, 2005, at 12:37 PM, Lily Li wrote:
>
> Hello,
>
> Occasionally, our MPI job crashes due to some signals, such as
> SIGTERM, SIGSEGV, SIGKILL,
> then the lamd on those nodes will also crash without any info. in the
> logfile.
>
> Is this a known issue in LAM ? Could you find any indication from
> this lamd's log file included below ?
> This is the last message before the lamd dies.
>
> We are using LAM 7.0 on Linux Redhat 9 with gcc 3.3.1. and TCP/IP.
> The command used is:
> mpirun -f -w -ssi rpi tcp schemafile
>
> Will the -npty option of mpirun help ?
>
> BTW, the recon command "-a" option of this version of LAM doesn't
> work as expected.
> it will stop on the first error found, and doesn't continue to check
> the remaining hosts in the hostfile.
> ( recon -v -a hostfile).
>
> Regards,
>
> Lily
>
>
> Mar 30 18:35:34 liv2 lamd[23722]: flatd: flqload - successfully created
> file /tmp/lam-oroper_at_liv2/lam-flatd10
> Mar 30 18:35:34 liv2 lamd[23722]: flatd: flqload - file descriptor 15
> Mar 30 18:35:34 liv2 lamd[23722]: flatd: flqload - successfully
> appended
> 115 bytes to /tmp/lam-oroper_at_liv2/lam-flatd10
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: pqcreating with rtf 0x441210
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: checking for directory
> /home/oroper
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: looking for executable
> "/cm/production/r3.00/LINUXM/bin/es" in directory "/home/oroper"
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: found
> "/cm/production/r3.00/LINUXM/bin/es"
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: creating new user process...
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting environment
> variables to pass to new process
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting TROLLIUSFD
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting TROLLIUSRTF
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMJOBID
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMKENYAPID
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMWORLD
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMPARENT
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMRANK
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: checking for working
> directory flag
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: working directory set
> explicitly
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: running in directory
> /home/oroper
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: fork/exec succeeded, pid
> 24335, index 11, rtf 0x441212
> Mar 30 18:35:34 liv2 lamd[23722]: kenyad: create succeeded, process
> running
> Mar 30 18:35:34 liv2 lamd[23722]: kio_req: new client on fd=15
> Mar 30 18:35:34 liv2 lamd[23722]: kouter: attached process pid=24335,
> pri=0, fd=15
> Mar 30 18:35:38 liv2 lamd[23722]: kouter: surrendered process pid=0
> Mar 30 18:35:38 liv2 lamd[23722]: died: caught child death; trying to
> detach
> Mar 30 18:35:38 liv2 lamd[23722]: died: detaching table entry 10
>
>
> -----------------------------------------------------------------------
> ------------- end of log ----------------------
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|