Thanks for the reply Brian. As to your first comment about the efficacy of
'lamboot -x' :
I thought I might have been missing something when I looked over the code
(literally I just followed the argument processing in lamboot). Obviously I
missed something - I had read about fault-tolerance many times and was really
surprised when I couldn't actually detect a difference in operation.
> If you are running LAM in fault-tolerant mode, the LAM "signal"
> LAM_SIGSHRINK should get sent to all other nodes when a node fails. I
I have experimented with the ksignal handlers (both with 6.5.9 and 7.0 -
which is working fine so far :-), and LAM_SIGSHRINK didn't always seem to
get raised.
> think this is what you want. See the lam_ksignal(2) man page for more
> information. You may have to compile LAM with the --with-trillium flag in
> order to install all the proper header files and man pages.
Ah - I just tried using lam_ksignal rather than ksignal (which is also buried
in the lam documentation via mpi.h). I would have thought these were the same
calls, but I've just checked with some simpler examples (put exit() in the
signal handler!) and I'm getting correct semaphores (LAM_SIGSHRINK gets sent
with fault-tolerant mode).
The problem seems to be that mpirun is hanging around for a nonexistent
process to finish when a node is tkilled. For all other purposes the
behaviour is perfect. I presume this is something to do with the protocol for
starting LAM remote procedures.
I know 'mpirun -nw' is one solution, but its not too useful for my purposes,
and I still dont see why mpirun hangs around if its already put up its 'It
seems that one of the processes has died with a nonzero...' message.
any ideas ?
Jim.
|