> On Mon, 2004-03-15 at 11:22, Ross Torkington wrote:
>
> > I've tried creating intercommunicators for each slave-master
> > relationship and I've tried using the lamboot fault tolerance option
> > (-x). My question is, with either of these methods, how do I have the
> > master recognise when a fault has occured? Is there a command I can
> > use to retrieve "heartbeat" information?
>
> Humm.. this seems to be an obscure topic, but I've done some research
> and found that signal LAM_SIGSHRINK is sent to every process when a dead
> node is detected. I also found that lamd_shrink() (LAM 7.0.4,
> share/ssi/rpi/lamd/src/ssi_rpi_lamd.c) is the handler, and it
> invalidates the node and the communicators involving it. The ksignal()
> function "redirects a signal to a user supplied handling routine" (LAM
> 7.0.4, share/kreq/ksignal.c), but I'am not sure how to properly use it.
> I'm under the impression some work have to be done to make this feature
> easily usable by the MPI applications. Any LAM developer has more
> information to share?
Sorry, for the delay. Actually, there is going to be a bit more delay. I
have had to dig through the code and learn before I answer you :-). So,
will answer as soon as I find out.
Regards,
Anju
>
> Regards,
>
> -- Ulisses
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|