I'm using LAM 6.5.9 to perform some paralle work on several networked machines
in our office. Some of these machines are other user machines who need to
boot back into Windows every now and then. I'm trying to make my code aware
so that when a node goes down it can take other action.
At present I have the slave codes catch SIGTERM (issued during a shutdown) and
send the master process a message stating that it is going down. This works
when I send a SIGTERM to a slave process. However, during an actual shutdown
the master never gets notified. I'm assuming that it is probably due to that
the all the processes receive a SIGTERM during a shutdown including the lam
deamon and the message never makes out.
Is that someway that the lam deamon can be set so that if it is about to be
killed a signal can be sent to all the LAM processes on its node so that they
can perform whatever actions?
Also, lamboot has a -x option for fault tolerance. The man page states that
When a node's heart beats stop, it is declared ``dead'' and all LAM
nodes (and processes) are notified. This allows users to
write fault tolerant applications that can degrade grace-
fully, or fully recover by replacing the defunct node with
another (see lamgrow(1)). Since this mode introduces a
performance penalty, it is not activated by default.
How are all the LAM nodes/processes notified? I haven't been able to locate
that information yet? Also, what sort of performance penalty is introduced.
Is it just that there is a little more network traffic involved and may
affect message intensive applications and so have little impact on
applications with little message passing?
|