LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-10-15 13:39:18


A few notes about the MPI side of things:

- It's still quite an open question how to do fault tolerance in an MPI
application properly. Solutions range from checkpoint / restart (in
LAM) to fully user-controlled (e.g., FT-MPI). What's the Right
solution? It's hard to say, and it's also likely to be an
application-specific answer. You might want to have a look at FT-MPI.

- Open MPI will eventually contain both checkpoint/restart and the
user-manual stuff in FT-MPI. This is not likely until mid-next year at
the earliest, however.

On Oct 15, 2004, at 2:35 PM, William Bierman wrote:

>> The kind of fail-over system you're describing is really one where
>> every
>> node can be a master, which basically means that they're equivalent in
>> capability. The "master" is nominated by an election based on some
>> sort
>> of rules (i.e. first to boot, fastest network, whatever). You could
>> possibly do this with LAM by using shell scripts to configure the
>> election winner as the master, say by NFS, and from there configure
>> the
>> slaves. The scripts would be pretty complex though. As Bogdan says,
>> you'll need a hot-swap master as well in case the master comes down.
>> If
>> a slave dies, you'd have to bring down the LAM universe and bring it
>> back up minus the dead node. More scripting. It's doable, but it's a
>> lot of work.
>
> Yes, this is exactly my goal. And yes, it is quite a lot of work.
> That is why I wanted to avoid re-inventing the wheel as much as
> possible. Doing it all from scratch does has some benefits though.
>
> Does anyone know what work has been done to show whether or not LAM
> (or any MPI suite) can adapt a running process without the process
> being aware of it, should a node drop out? This should theoretically
> be possible, especially if you're willing to use some sort of a
> kludge.. say just restart the process doubling up on a processor, and
> re-assign the dropped out node's 'id'. There are still of course
> issues with that (the memory that was lost, etc.) .. but I am by no
> means an expert in this field.
>
> Thanks for the reply!
>
> Bill
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/