LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: William Bierman (wbierman_at_[hidden])
Date: 2004-10-15 15:29:42


> > - It's still quite an open question how to do fault tolerance in an
> > MPI application properly. Solutions range from checkpoint / restart
> > (in LAM) to fully user-controlled (e.g., FT-MPI). What's the Right
> > solution? It's hard to say, and it's also likely to be an
> > application-specific answer. You might want to have a look at FT-MPI.
>
> There is another interesting projects that implements checkpoint /
> restart based on various checkpointing and message logging protocols
> called MPICH-V. More details at http://www.lri.fr/~gk/MPICH-V/

I think at the end of the day, the only way to be fault tolerant on
the process handling level (meaning the scheduler being able to do the
adapting for a process when a node drops off), is to take snapshots of
every node's memory for that process, and restore it someplace else.
This is obviously insane. Perhaps I will limit myself to making a
master pop up in the place of one that dies off, and simply kill all
processes and restart them.

Thanks for everyone's input! That paper was particularly interesting.

Bill