LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2002-11-09 19:04:49


On Mon, 21 Oct 2002, David Mellor wrote:

> This next question then is - can I still receive messages from multiple
> slaves that are now spread over many different communicators? Is it possible
> to give probe and recv functions an 'any_communicator' flag for the
> communicator, or would I have to multi-thread a whole series of reads
> myself?

(embarrassed look on face)

Sorry for the delay in responding. You would not be able to use probe or
recv over multiple communicators. There isn't a good way around this
problem - you could probably build up a communicator from all the
"pair-wise" communicators with worker processes and simply build a new one
each time a failure occurs. However, LAM isn't great about recycling
communicators at this point, so if you have enough failures, you will not
have enough resources to create a new communicator.

Hope this helps,

Brian

> ----- Original Message -----
> From: "Brian W. Barrett" <brbarret_at_[hidden]>
> To: <lam_at_[hidden]>
> Sent: Wednesday, October 16, 2002 4:54 PM
> Subject: Re: LAM: fault tolerance question
>
>
> > On Fri, 11 Oct 2002, David Mellor wrote:
> >
> > > I have been trying to get a LAM-MPI application to work in a fault
> tolerance
> > > fashion. My application uses timeout logic to tolerate the death of
> slave
> > > processes, but it requires that the MPI implementation continues
> > > communicating in the face of slave failures. I have had limited success
> when
> > > using the mpirun -nw (no-wait) option, but I am having problems with
> > > MPI_Get_count returning random large messages sizes on the master (rank
> 0),
> > > after a slave has been killed, and with slaves dying without warning.
> > >
> > > Does any please have any information on the techniques & wisdom of using
> > > LAM-MPI in this way please? 100% correct fault tolerance behaviour is
> not
> > > essential.
> >
> > First, the boilerplate warning. Everything contained in this e-mail
> > applies to LAM/MPI only. The implementation details exploited here will
> > not work on other MPI implementations. In a perfect world (ie, no
> > failures), any codes written to exploit what little fault tolerance there
> > is on LAM/MPI will run just fine on other MPI implementations. But as
> > soon a a failure occurs, all bets are off.
> >
> > Ok, now that I'm done scaring people away, on to the fun stuff :). The
> > MPI layer in LAM/MPI is not particularily fault tolerant. There are a
> > number of reasons for this, notably the fact that the MPI standard makes
> > it very difficult to be both conformant and fault tolerant. In
> > particular, when a process that is a rank in a communicator dies, LAM has
> > a hard time doing much of anything else useful with anything at all
> > associated with that communicator (including calling MPI_Get_count on a
> > Status object that was used in communication on that communicator).
> >
> > For now, the only feasible solution in LAM/MPI is to use MPI_COMM_SPAWN to
> > spawn off all processes individually, so that the manager has a bunch of
> > individual communicators to the workers. When a worker dies, you (as in
> > the application) have to recognize that all the status structures /
> > communicators / etc with that process just became useless. But you still
> > have perfectly valid structures with all the other processes.
> >
> > There is an example in examples/fault/, complete with README, in the
> > LAM/MPI source tarball. It should point you in the right direction. One
> > other important note. The manager can't call MPI_DISCONNECT on a
> > communicator who's worker has died. It may (I'm not positive) be able to
> > get away with MPI_COMM_FREE. But no matter what, you can't properly clean
> > up after yourself. So a very small amount of resources will be leaked on
> > each failure. This is regrettable, but unavoidable currently.
> >
> > We are actively researching fault tolerance problems in the MPI
> > environment, and hope to solve many of these problems. However, it will
> > be some time before there results are released in a stable release of
> > LAM/MPI.
> >
> > Hope this helps,
> >
> > Brian
> >
> > --
> > Brian Barrett
> > LAM/MPI developer and all around nice guy
> > Have a LAM/MPI day: http://www.lam-mpi.org/
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
  Brian Barrett
  LAM/MPI developer and all around nice guy
  Have a LAM/MPI day: http://www.lam-mpi.org/