Brian, thank you for the feedback. All of our processes are started using LSF or LoadLeveler for most of our customers, so I don't know that dynamic processes are applicable here. Are you implying that it can't work with static processes or just that the example uses dynamic processes?
Brion
Brion Keller | Sr. Architect
P: 607.762.1347 M: 607.427.1334 F: 607.762.1301 www.cadence.com
-----Original Message-----
From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf Of Brian W. Barrett
Sent: Tuesday, August 24, 2010 2:44 PM
To: General LAM/MPI mailing list
Subject: Re: LAM: How to continue when a slave process dies
On Tue, 24 Aug 2010, Brion Keller wrote:
> I?ve been asked to update some parallel processing functions to allow
> them to continue if a slave process dies. There is one overall controller
> process and two types of slaves. Generally, the overall controller and
> remaining slave could continue processing, but currently if a slave
> process dies (e.g. a segmentation fault, or runs out of space in a
> filesystem), all processes are killed, so there is no means to continue
> on. In reading the documentation, it appears that I need to handle
> signals that may kill the slave process and if they are received, send a
> message back to the main process that this slave has died and then call
> MPI_Finalize() before exiting the slave process. Is this correct? Is
> there any other way to deal with this just from the overall controller
> process without having to change the slave processes to handle signals?
This is largely impossible to do portably in MPI, as each MPI has a
slightly different set of rules on what happens in the presence of
failures. Using MPI's dynamic processes, it's possible to create an
application which can survive worker processes dying with LAM/MPI. In
fact, there's an example in the LAM tarball that will do exactly that. It
makes a pretty good template as to how to structure the communication so
that workers can die without taking down the controller. While the
application itself will run on any MPI-2 compliant MPI implementation,
it's failure survival is slightly less portable.
Good luck,
Brian
--
Brian Barrett
LAM/MPI Developer
Make today a LAM/MPI day!
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|