LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jim Procter (procter_at_[hidden])
Date: 2003-05-31 14:58:57


Thanks for the reply Brian. As to your first comment about the efficacy of
'lamboot -x' :
I thought I might have been missing something when I looked over the code
(literally I just followed the argument processing in lamboot). Obviously I
missed something - I had read about fault-tolerance many times and was really
surprised when I couldn't actually detect a difference in operation.

> If you are running LAM in fault-tolerant mode, the LAM "signal"
> LAM_SIGSHRINK should get sent to all other nodes when a node fails. I

I have experimented with the ksignal handlers (both with 6.5.9 and 7.0 -
 which is working fine so far :-), and LAM_SIGSHRINK didn't always seem to
 get raised.

> think this is what you want. See the lam_ksignal(2) man page for more
> information. You may have to compile LAM with the --with-trillium flag in
> order to install all the proper header files and man pages.
Ah - I just tried using lam_ksignal rather than ksignal (which is also buried
in the lam documentation via mpi.h). I would have thought these were the same
calls, but I've just checked with some simpler examples (put exit() in the
signal handler!) and I'm getting correct semaphores (LAM_SIGSHRINK gets sent
with fault-tolerant mode).

The problem seems to be that mpirun is hanging around for a nonexistent
process to finish when a node is tkilled. For all other purposes the
behaviour is perfect. I presume this is something to do with the protocol for
starting LAM remote procedures.

I know 'mpirun -nw' is one solution, but its not too useful for my purposes,
and I still dont see why mpirun hangs around if its already put up its 'It
seems that one of the processes has died with a nonzero...' message.

any ideas ?
Jim.