LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Ulisses (ra993482_at_[hidden])
Date: 2004-03-17 16:24:09


On Mon, 2004-03-15 at 11:22, Ross Torkington wrote:

> I've tried creating intercommunicators for each slave-master
> relationship and I've tried using the lamboot fault tolerance option
> (-x). My question is, with either of these methods, how do I have the
> master recognise when a fault has occured? Is there a command I can
> use to retrieve "heartbeat" information?

        Humm.. this seems to be an obscure topic, but I've done some research
and found that signal LAM_SIGSHRINK is sent to every process when a dead
node is detected. I also found that lamd_shrink() (LAM 7.0.4,
share/ssi/rpi/lamd/src/ssi_rpi_lamd.c) is the handler, and it
invalidates the node and the communicators involving it. The ksignal()
function "redirects a signal to a user supplied handling routine" (LAM
7.0.4, share/kreq/ksignal.c), but I'am not sure how to properly use it.
I'm under the impression some work have to be done to make this feature
easily usable by the MPI applications. Any LAM developer has more
information to share?

Regards,

-- Ulisses