On Sep 2, 2005, at 1:35 AM, Jim Lasc wrote:
> When you have a process which crashes, all the processes of his
> COMM_WORLD allso seem to crash.
Correct; LAM does this by default because the default error action for
MPI is MPI_ERRORS_ABORT (meaning that one error triggers aborting the
rest of the processes).
> What should you do to make a MPI-program "crash-resistent"; with
> which I mean: make that not all the nodes crash when one goes down?
> (I'm not speaking of synchronising the data between the nodes to avoid
> losses etc...)
This is a relatively difficult area for most MPI implementations
(including LAM). The issue is that the MPI standard does *not*
guarantee the status of anything once an error occurs. You can set the
default error handler on MPI_COMM_WORLD to be something other than
MPI_ERRORS_ABORT (e.g., MPI_ERRORS_RETURN), but the internal state of
the MPI implementation may not be stable when an error occurs (e.g.,
you try to MPI_SEND to a process that has died).
So I don't really have a good answer for you, unfortunately. There's a
lot of research going on in this area right now, but no MPI
implementation is categorically "bullet proof" in this kind of
situation.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|