I've observed the following behavior when prototyping a master-slave
program in LAM: when a simulated slave failure occurs (say, by kiling
the lamd process on a slave node) the master catches the ensuing
exception and runs the exception handling code when communication with
that node is attempted (so far, so good) but then the program hangs
when it gets to MPI::Finalize or MPI::Abort. Tagging on the -nw flag to
mpirun simply causes the executable to hang quietly in the background.
My question then is: is there (or will there be) a reliable way to quit
an MPI program - with error code - either from within the code, or via
combination of (LAM) compile-time or runtime options, once a node has
been lost?
Thanks!
-Maciek
|