Jeff Squyres wrote:
> A few notes about the MPI side of things:
>
> - It's still quite an open question how to do fault tolerance in an
> MPI application properly. Solutions range from checkpoint / restart
> (in LAM) to fully user-controlled (e.g., FT-MPI). What's the Right
> solution? It's hard to say, and it's also likely to be an
> application-specific answer. You might want to have a look at FT-MPI.
There is another interesting projects that implements checkpoint /
restart based on various checkpointing and message logging protocols
called MPICH-V. More details at http://www.lri.fr/~gk/MPICH-V/
In my opinion, before thinking about fault-tolerance, its important to
know what kind of faults are we addressing and at what level ? The
choices are library (LAM)-level, network-level (mayb part of library),
runtime-level, or the application-level. As Jeff mentioned, FT-MPI is
user-controlled, thus it has been left to the application developers to
incorporate fault tolerance in their code. Thus the fault tolerance is
within the application itself supported to some extent by the library.
On the other hand, some other implementations are incorporating support
for ignoring failures at lower levels, like at runtime/library, that may
effect the performance even in the case of no failures.
Also, fault tolerance depends on the nature of the application itself.
For some, if a process dies, then it may not effect the overall
execution in any way. The master may allocate the same task to some
other process or some other way. But for some other applications, it may
become critical.
Fault tolerance may not be the most important thing at the moment but it
may become very critical in the future when we envisage to have 100,000
processors cluster. In such machines, the rate of failures would be more
than the time to checkpoint / restart the application. This last para is
motivated from the paper at
http://www.csm.ornl.gov/~geist/Lyon2002-geist.pdf
Just a few thoughts.
--Aamir
>
> - Open MPI will eventually contain both checkpoint/restart and the
> user-manual stuff in FT-MPI. This is not likely until mid-next year
> at the earliest, however.
>
>
> On Oct 15, 2004, at 2:35 PM, William Bierman wrote:
>
>>> The kind of fail-over system you're describing is really one where
>>> every
>>> node can be a master, which basically means that they're equivalent in
>>> capability. The "master" is nominated by an election based on some
>>> sort
>>> of rules (i.e. first to boot, fastest network, whatever). You could
>>> possibly do this with LAM by using shell scripts to configure the
>>> election winner as the master, say by NFS, and from there configure the
>>> slaves. The scripts would be pretty complex though. As Bogdan says,
>>> you'll need a hot-swap master as well in case the master comes
>>> down. If
>>> a slave dies, you'd have to bring down the LAM universe and bring it
>>> back up minus the dead node. More scripting. It's doable, but it's a
>>> lot of work.
>>
>>
>> Yes, this is exactly my goal. And yes, it is quite a lot of work.
>> That is why I wanted to avoid re-inventing the wheel as much as
>> possible. Doing it all from scratch does has some benefits though.
>>
>> Does anyone know what work has been done to show whether or not LAM
>> (or any MPI suite) can adapt a running process without the process
>> being aware of it, should a node drop out? This should theoretically
>> be possible, especially if you're willing to use some sort of a
>> kludge.. say just restart the process doubling up on a processor, and
>> re-assign the dropped out node's 'id'. There are still of course
>> issues with that (the memory that was lost, etc.) .. but I am by no
>> means an expert in this field.
>>
>> Thanks for the reply!
>>
>> Bill
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
|