Greetings,
Under LAM 7.1.1 lamhalt works correctly, but lamwipe does not seem to export the
suffix to the remote nodes so tkill fails on them.
The following debug fragment should illustrate the problem. The tkill works on the
first (n0), then fails on all the others.
(The LAM_MPI_SOCKET_SUFFIX and LAM_MPI_SESSION_SUFFIX was set to 'fredtest'
before the lamboot and lamwipe.)
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-testacc_at_testnode49-fredtest/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-testacc_at_testnode49-fredtest/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-testacc_at_testnode49-fredtest/lam-io-socket
tkill: f_kill = "/tmp/lam-testacc_at_testnode49-fredtest/lam-killfile"
tkill: nothing to kill: "/tmp/lam-testacc_at_testnode49-fredtest/lam-killfile"
n-1<30538> ssi:boot:rsh: successfully launched on n0 (testnode49)
n-1<30538> ssi:boot:base:linear: booting n1 (testnode18)
n-1<30538> ssi:boot:rsh: starting wipe on (testnode18)
n-1<30538> ssi:boot:rsh: starting on n1 (testnode18): tkill -d -v
n-1<30538> ssi:boot:rsh: launching remotely
n-1<30538> ssi:boot:rsh: attempting to execute: rsh testnode18 -n 'echo $SHELL'
n-1<30538> ssi:boot:rsh: remote shell /bin/tcsh
n-1<30538> ssi:boot:rsh: attempting to execute: rsh testnode18 -n tkill -d -v
tkill: setting prefix to (null)
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-testacc_at_testnode18/lam-killfile
tkill: f_kill = "/tmp/lam-testacc_at_testnode18/lam-killfile"
tkill: nothing to kill: "/tmp/lam-testacc_at_testnode18/lam-killfile"
n-1<30538> ssi:boot:rsh: successfully launched on n1 (testnode18)
n-1<30538> ssi:boot:base:linear: booting n2 (testnode32)
n-1<30538> ssi:boot:rsh: starting wipe on (testnode32)
n-1<30538> ssi:boot:rsh: starting on n2 (testnode32): tkill -d -v
n-1<30538> ssi:boot:rsh: launching remotely
n-1<30538> ssi:boot:rsh: attempting to execute: rsh testnode32 -n 'echo $SHELL'
n-1<30538> ssi:boot:rsh: remote shell /bin/tcsh
n-1<30538> ssi:boot:rsh: attempting to execute: rsh testnode32 -n tkill -d -v
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-testacc_at_testnode32/lam-killfile
tkill: f_kill = "/tmp/lam-testacc_at_testnode32/lam-killfile"
tkill: nothing to kill: "/tmp/lam-testacc_at_testnode32/lam-killfile"
lamwipe is still preferred over lamhalt because lamhalt does not complete
if a node failure has occurred.
The most recent beta I have tried:lam-7.2b1r10139 has this problem too.
Best regards,
Richard Winnett
Fugro Seismic Imaging Ltd
|