Sorry for the delay:
It is very tough to implement fault tolerance with source code portability
in MPI environment. However, if you don't bother much about source code
portability, then there are a couple of solutions:
1. Write MPI code which implements heartbeats amongst the manager and the
worker threads. This can be done through non-blocking send/recv like
functionality which keeps tabs on processes. Also you will want to set the
error handlers on the pertinent communicators to be MPI_ERRORS_RETURN. But
this might not always work out properly since there is no state
information returned. So, it is very difficult to analyse what exactly
caused the error in this case. This works when you don't care much for
what error has occured.
2. Register the LAM signal handler for LAM_SIGSHRINK and use this to find
out when a process is dead. This will provide for heartbeat like
functionality thereby allowing determination of a node failure and allows
the user to take appropriate measures. This works with the "-x" option of
lamboot. Note that this is the LAM signal handling facility (not posix)
although the semantics of what you can/cannot do are almost the same as
posix signal handlers. This used to be the primary functionality in
earlier versions of LAM and but of late has not been extensively used and
therefore not been reliably tested in recent times.
3. This is the least possibility solution and again is not portable. Users
can register their own error handler and catch MPI_ERR_REMOTEDEAD on the
communicator (inter). This again has not been reliably tested in recent
times.
You might also want to see the fault tolerance example which ships along
with the LAM tar-balls. The relative path is "examples/fault" from the top
level directory. Another resource is the man page for lam_ksignal.
Hope this helped,
Anju
This too shall pass ......
On Wed, 17 Mar 2004, Prabhanjan Kambadur wrote:
>
>
>
> > On Mon, 2004-03-15 at 11:22, Ross Torkington wrote:
> >
> > > I've tried creating intercommunicators for each slave-master
> > > relationship and I've tried using the lamboot fault tolerance option
> > > (-x). My question is, with either of these methods, how do I have the
> > > master recognise when a fault has occured? Is there a command I can
> > > use to retrieve "heartbeat" information?
> >
> > Humm.. this seems to be an obscure topic, but I've done some research
> > and found that signal LAM_SIGSHRINK is sent to every process when a dead
> > node is detected. I also found that lamd_shrink() (LAM 7.0.4,
> > share/ssi/rpi/lamd/src/ssi_rpi_lamd.c) is the handler, and it
> > invalidates the node and the communicators involving it. The ksignal()
> > function "redirects a signal to a user supplied handling routine" (LAM
> > 7.0.4, share/kreq/ksignal.c), but I'am not sure how to properly use it.
> > I'm under the impression some work have to be done to make this feature
> > easily usable by the MPI applications. Any LAM developer has more
> > information to share?
>
> Sorry, for the delay. Actually, there is going to be a bit more delay. I
> have had to dig through the code and learn before I answer you :-). So,
> will answer as soon as I find out.
>
> Regards,
> Anju
>
> >
> > Regards,
> >
> > -- Ulisses
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|