Does it look like they've exploited MPICH-specific features? The reason
I ask is that fault-tolerance is a capability delivered through
architectural design, and not the platform. Check out the release notes
for MySQL 4.11 on clustering databases as an example. It delivers
fault-tolerance and redundancy by duplicating sections of the database
on separate processes.
You also need to describe what you mean by fault-tolerance from your
perspective. What are the faults you need to tolerate?
Fault-tolerance typically requires duplication or reconstruction of data
(that's what humans do). Duplication is straightforward, and has a
communication load to rebuild from the remote storage. Reconstruction
is based on logging changes and being able to rebuild when required (e.g
journalling file systems). What do you need to do?
Damien
PS Hope you all have had an excellent Christmas and New Year. For the
PC Police, I intend this to be in the spirit rather than the letter of
the salutation, and I throw myself on the mercy of the legal
profession. May the deity of your choice protect me.
Yaron Minsky wrote:
>Gropp and Lusk wrote a paper called "Fault Tolerance in MPI
>Programs"[1] the suggested an intercommunicator-based approach to
>building a fault tolerant worker-slave style application. I've tried
>to do something similar in LAM and have failed utterly. Has anyone
>succeeded? And if so, do they have any examples that one could look
>at?
>
>The authors appear to have built their example using MPICH. Is MPICH
>a more congenial environment for this class of applications?
>
>Thanks in advance,
>Yaron Minsky
>
>[1] http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
|