I've been asked to update some parallel processing functions to allow them to continue if a slave process dies. There is one overall controller process and two types of slaves. Generally, the overall controller and remaining slave could continue processing, but currently if a slave process dies (e.g. a segmentation fault, or runs out of space in a filesystem), all processes are killed, so there is no means to continue on. In reading the documentation, it appears that I need to handle signals that may kill the slave process and if they are received, send a message back to the main process that this slave has died and then call MPI_Finalize() before exiting the slave process. Is this correct? Is there any other way to deal with this just from the overall controller process without having to change the slave processes to handle signals?
Brion
|