Gropp and Lusk wrote a paper called "Fault Tolerance in MPI
Programs"[1] the suggested an intercommunicator-based approach to
building a fault tolerant worker-slave style application. I've tried
to do something similar in LAM and have failed utterly. Has anyone
succeeded? And if so, do they have any examples that one could look
at?
The authors appear to have built their example using MPICH. Is MPICH
a more congenial environment for this class of applications?
Thanks in advance,
Yaron Minsky
[1] http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf
|