OK. Thanks a lot.
Just out of interest (to give the students some backgroundinfo), but does
pvm has such a "bullet-proof" algorithm ??
Jim
On 9/2/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> On Sep 2, 2005, at 1:35 AM, Jim Lasc wrote:
>
> > When you have a process which crashes, all the processes of his
> > COMM_WORLD allso seem to crash.
>
> Correct; LAM does this by default because the default error action for
> MPI is MPI_ERRORS_ABORT (meaning that one error triggers aborting the
> rest of the processes).
>
> > What should you do to make a MPI-program "crash-resistent"; with
> > which I mean: make that not all the nodes crash when one goes down?
> > (I'm not speaking of synchronising the data between the nodes to avoid
> > losses etc...)
>
> This is a relatively difficult area for most MPI implementations
> (including LAM). The issue is that the MPI standard does *not*
> guarantee the status of anything once an error occurs. You can set the
> default error handler on MPI_COMM_WORLD to be something other than
> MPI_ERRORS_ABORT (e.g., MPI_ERRORS_RETURN), but the internal state of
> the MPI implementation may not be stable when an error occurs (e.g.,
> you try to MPI_SEND to a process that has died).
>
> So I don't really have a good answer for you, unfortunately. There's a
> lot of research going on in this area right now, but no MPI
> implementation is categorically "bullet proof" in this kind of
> situation.
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|