LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-10-13 16:41:33


LAM is -- at best -- only pseudo-able to handle the death of an MPI
process. Specifically, I wouldn't recommend trying to write a fault
tolerant MPI application using LAM/MPI that could withstand the death
of a process in MPI_COMM_WORLD.

Keep in mind that MPI [quite intentionally] does not specify what
happens when a process dies, so it's totally up to the implementation
as to what to do. Most MPI's, LAM/MPI included, simply kill the rest
of the job. FT-MPI out of the University of Tennessee allows you to
do some interesting things, but you need to specifically write code
to their API, etc.

Work is ongoing in Open MPI to be able to handle these kinds of
errors. The first step is adding checkpoint/restart capabilities in
Open MPI (the hardest part of which is all the infrastructure needed
to make that possible), and then we'll do more interesting things
after that (to include FT-MPI-like things).

On Oct 11, 2006, at 8:40 AM, Alexander L. Belikoff wrote:

> Jeff Squyres wrote:
>>
>> These error messages mean that processes 2-7 tried to do a receive
>> from
>> someone who they later found out were dead, so they aborted.
>>
> What would be a "standard" (that is, a portable) way for one of the
> peer
> processes to get notified about such a death? For example, if one of
> processes dies, I'd like the process of rank 0 to know it in order to
> change the strategy.
>
> Cheers,
> -- Sasha
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems