LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-09-02 06:04:56


On Sep 2, 2005, at 1:35 AM, Jim Lasc wrote:

> When you have a process which crashes, all the processes of his
> COMM_WORLD allso seem to crash.

Correct; LAM does this by default because the default error action for
MPI is MPI_ERRORS_ABORT (meaning that one error triggers aborting the
rest of the processes).

> What should you do to make a MPI-program "crash-resistent"; with
> which I mean: make that not all the nodes crash when one goes down?
> (I'm not speaking of synchronising the data between the nodes to avoid
> losses etc...)

This is a relatively difficult area for most MPI implementations
(including LAM). The issue is that the MPI standard does *not*
guarantee the status of anything once an error occurs. You can set the
default error handler on MPI_COMM_WORLD to be something other than
MPI_ERRORS_ABORT (e.g., MPI_ERRORS_RETURN), but the internal state of
the MPI implementation may not be stable when an error occurs (e.g.,
you try to MPI_SEND to a process that has died).

So I don't really have a good answer for you, unfortunately. There's a
lot of research going on in this area right now, but no MPI
implementation is categorically "bullet proof" in this kind of
situation.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/