LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Lily Li (lily.li_at_[hidden])
Date: 2005-03-31 12:37:19


Hello,

Occasionally, our MPI job crashes due to some signals, such as SIGTERM,
SIGSEGV, SIGKILL,
then the lamd on those nodes will also crash without any info. in the
logfile.

Is this a known issue in LAM ? Could you find any indication from this
lamd's log file included below ?
This is the last message before the lamd dies.

We are using LAM 7.0 on Linux Redhat 9 with gcc 3.3.1. and TCP/IP.
The command used is:
   mpirun -f -w -ssi rpi tcp schemafile

Will the -npty option of mpirun help ?

BTW, the recon command "-a" option of this version of LAM doesn't work
as expected.
it will stop on the first error found, and doesn't continue to check the
remaining hosts in the hostfile.
( recon -v -a hostfile).

Regards,

Lily

Mar 30 18:35:34 liv2 lamd[23722]: flatd: flqload - successfully created
file /tmp/lam-oroper_at_liv2/lam-flatd10
Mar 30 18:35:34 liv2 lamd[23722]: flatd: flqload - file descriptor 15
Mar 30 18:35:34 liv2 lamd[23722]: flatd: flqload - successfully appended
115 bytes to /tmp/lam-oroper_at_liv2/lam-flatd10
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: pqcreating with rtf 0x441210
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: checking for directory
/home/oroper
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: looking for executable
"/cm/production/r3.00/LINUXM/bin/es" in directory "/home/oroper"
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: found
"/cm/production/r3.00/LINUXM/bin/es"
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: creating new user process...
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting environment variables
to pass to new process
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting TROLLIUSFD
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting TROLLIUSRTF
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMJOBID
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMKENYAPID
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMWORLD
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMPARENT
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: setting LAMRANK
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: checking for working directory
flag
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: working directory set
explicitly
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: running in directory
/home/oroper
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: fork/exec succeeded, pid
24335, index 11, rtf 0x441212
Mar 30 18:35:34 liv2 lamd[23722]: kenyad: create succeeded, process
running
Mar 30 18:35:34 liv2 lamd[23722]: kio_req: new client on fd=15
Mar 30 18:35:34 liv2 lamd[23722]: kouter: attached process pid=24335,
pri=0, fd=15
Mar 30 18:35:38 liv2 lamd[23722]: kouter: surrendered process pid=0
Mar 30 18:35:38 liv2 lamd[23722]: died: caught child death; trying to
detach
Mar 30 18:35:38 liv2 lamd[23722]: died: detaching table entry 10

------------------------------------------------------------------------------------ end of log ----------------------