Hi Yaron,
I have tried to build a fault tolerant system using intercommunicators
with lam a while ago. I based by work on lam's fault tolerance example.
I have failed as well: The system could handle one crashed slave. A
second dead intracommunicator was not properly detected by MPI_Waitany.
Maybe you want to check the lists archive for the discussing that
followed [1].
For the application we decided that faults are best handled on an
application level - parts of the simulation needed to be rerun in case
of an error. Considering the added complexity of a fault tolerance code
compared to the rare occurence of a fault happening, this seems to be
the right way to go for us.
Good luck!
Michael
[1] http://www.lam-mpi.org/MailArchives/lam/msg08380.php
Am Montag, den 03.01.2005, 16:15 -0500 schrieb Yaron Minsky:
> Gropp and Lusk wrote a paper called "Fault Tolerance in MPI
> Programs"[1] the suggested an intercommunicator-based approach to
> building a fault tolerant worker-slave style application. I've tried
> to do something similar in LAM and have failed utterly. Has anyone
> succeeded? And if so, do they have any examples that one could look
> at?
>
> The authors appear to have built their example using MPICH. Is MPICH
> a more congenial environment for this class of applications?
>
> Thanks in advance,
> Yaron Minsky
>
> [1] http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|