LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Alexander L. Belikoff (ABEL_at_[hidden])
Date: 2006-10-17 13:36:13


Hmm... checkpoints and restarts is good stuff in general but they
require a lot of redesign and a significant amount of added complexity.

The reason I raised the question in the first place was because I was
toying with an idea of transitioning a distributed application that
currently uses PVM to MPI. In PVM, handling peer process termination as
well as control over the application topology in the VM is fairly easy.
Since we are obsessive (for a good reason) about fault tolerance, our
application (that is, it's "rank 0" master process) knows when a "slave"
dies and resubmits the job to another one. Moreover, we can also do cool
things like "blacklisting" some slaves on a certain machine, when we are
confident the machine is not doing well, and get those slaves restarted
after some period of time - all from within the master process!

Unfortunately, I don't see how this level of service can be achieved in
MPI (at least in a fairly standard-compliant implementation) -
especially given your response. Which is somewhat sad, since in a
distributed application (which is MPI's "raison d'etre") there are
plenty of points of failure and many failures are not critical enough to
justify the full application restart. It would be great to see a fairly
simple API (no need for transaction/restarts/checkpoints) achieving just
that - in my opinion, it would make MPI much more suitable for
reasonably fault-tolerant applications (a requirement for many large
systems, including the one, I'm dealing with.

Regards,
-- Sasha

Jeff Squyres wrote:
>
> LAM is -- at best -- only pseudo-able to handle the death of an MPI
> process. Specifically, I wouldn't recommend trying to write a fault
> tolerant MPI application using LAM/MPI that could withstand the death
> of a process in MPI_COMM_WORLD.
>
> Keep in mind that MPI [quite intentionally] does not specify what
> happens when a process dies, so it's totally up to the implementation
> as to what to do. Most MPI's, LAM/MPI included, simply kill the rest
> of the job. FT-MPI out of the University of Tennessee allows you to
> do some interesting things, but you need to specifically write code
> to their API, etc.
>
> Work is ongoing in Open MPI to be able to handle these kinds of
> errors. The first step is adding checkpoint/restart capabilities in
> Open MPI (the hardest part of which is all the infrastructure needed
> to make that possible), and then we'll do more interesting things
> after that (to include FT-MPI-like things).
>