On Mon, 03 Jan 2005 22:25:57 -0700, Damien <damien_at_[hidden]> wrote:
> Does it look like they've exploited MPICH-specific features? The reason
> I ask is that fault-tolerance is a capability delivered through
> architectural design, and not the platform. Check out the release notes
> for MySQL 4.11 on clustering databases as an example. It delivers
> fault-tolerance and redundancy by duplicating sections of the database
> on separate processes.
It doesn't look like it, but it's hard to tell, and everything I've
read since suggests that this style of application is hard to build
with LAM.
> You also need to describe what you mean by fault-tolerance from your
> perspective. What are the faults you need to tolerate?
The fault-tolerance guarantees are quite simple, and require no
redundancy. I'm trying to provide fault tolerance for bag-of-job
style applications. So there's some server parceling out jobs and
receiving replies. You generally want to tolerate failures of the
worker nodes, not of the server. If a server notices that a worker
has died, it just re-parcels out that job, easy-breezy.
Or rather, it would be easy, if LAM could tolerate the crash of a
participating process in a reasonable way. But it doesn't sound like
it can.
> Fault-tolerance typically requires duplication or reconstruction of data
> (that's what humans do). Duplication is straightforward, and has a
> communication load to rebuild from the remote storage. Reconstruction
> is based on logging changes and being able to rebuild when required (e.g
> journalling file systems). What do you need to do?
>
> Damien
>
> PS Hope you all have had an excellent Christmas and New Year. For the
> PC Police, I intend this to be in the spirit rather than the letter of
> the salutation, and I throw myself on the mercy of the legal
> profession. May the deity of your choice protect me.
>
> Yaron Minsky wrote:
>
> >Gropp and Lusk wrote a paper called "Fault Tolerance in MPI
> >Programs"[1] the suggested an intercommunicator-based approach to
> >building a fault tolerant worker-slave style application. I've tried
> >to do something similar in LAM and have failed utterly. Has anyone
> >succeeded? And if so, do they have any examples that one could look
> >at?
> >
> >The authors appear to have built their example using MPICH. Is MPICH
> >a more congenial environment for this class of applications?
> >
> >Thanks in advance,
> >Yaron Minsky
> >
> >[1] http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/mpi-fault.pdf
> >_______________________________________________
> >This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
>
|