(Ahem - Sorry - premature mail-queue injection)
The short question is :
Can I, as a local MPI process, access any LAM status information about the
health of any other nodes so I can die gracefully myself?
The background to the question is as follows :
I have been wondering if it is possible for an MPI process to become aware of
a change in status of any of the LAM nodes. This is really in the extreme
case when some node is killed with no warning (so no signals can be raised to
any of the others). At best, any LAM commands hang rather than timing out in
this situation, which is nearly as bad for automatic detection as having none
at all. Even worse is that a simple MPI job will also hang - mpirun does not
return because a node has died, only if any of the MPI processes has died,
providing the appropriate option is set.
What I desire (or at least would like very much) is if an executive process
could notice when the LAM nodeset has broken, via LAM itself, and so issue
any appropriate cleaning and rebooting commands to reinstate the
multi-computer.
I've tried trapping signals with success but they aren't raised during
sudden-death situations. In fact there doesn't seem to be anything that
relates to the 'heartbeat' mentioned in the brief descriptions of the
fault-tolerant mode that I have found - even in the context of the dynamic
spawning MPI-2 commands.
again, sorry for the double post..
j.
|