> > - It's still quite an open question how to do fault tolerance in an
> > MPI application properly. Solutions range from checkpoint / restart
> > (in LAM) to fully user-controlled (e.g., FT-MPI). What's the Right
> > solution? It's hard to say, and it's also likely to be an
> > application-specific answer. You might want to have a look at FT-MPI.
>
> There is another interesting projects that implements checkpoint /
> restart based on various checkpointing and message logging protocols
> called MPICH-V. More details at http://www.lri.fr/~gk/MPICH-V/
I think at the end of the day, the only way to be fault tolerant on
the process handling level (meaning the scheduler being able to do the
adapting for a process when a node drops off), is to take snapshots of
every node's memory for that process, and restore it someplace else.
This is obviously insane. Perhaps I will limit myself to making a
master pop up in the place of one that dies off, and simply kill all
processes and restart them.
Thanks for everyone's input! That paper was particularly interesting.
Bill
|