Hello, I have a four node cluster and an MPI program that distributes work in a master/slave fashion. I want the master to be able to detect if a slave stops responding (via system crash or shutdown) and redistribute work so that the program can finish and exit cleanly.
I've tried creating intercommunicators for each slave-master relationship and I've tried using the lamboot fault tolerance option (-x). My question is, with either of these methods, how do I have the master recognise when a fault has occured? Is there a command I can use to retrieve "heartbeat" information?
I've seen references to ksignal but am unsure about what this is/does. I can't find a manual page for it.
Thanks for your help!
|