LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-05-31 10:39:12


If you are running LAM in fault-tolerant mode, the LAM "signal"
LAM_SIGSHRINK should get sent to all other nodes when a node fails. I
think this is what you want. See the lam_ksignal(2) man page for more
information. You may have to compile LAM with the --with-trillium flag in
order to install all the proper header files and man pages.

Hope this helps,

Brian

On Tue, 27 May 2003, Jim Procter wrote:

>
> (Ahem - Sorry - premature mail-queue injection)
>
> The short question is :
> Can I, as a local MPI process, access any LAM status information about the
> health of any other nodes so I can die gracefully myself?
>
> The background to the question is as follows :
> I have been wondering if it is possible for an MPI process to become aware of
> a change in status of any of the LAM nodes. This is really in the extreme
> case when some node is killed with no warning (so no signals can be raised to
> any of the others). At best, any LAM commands hang rather than timing out in
> this situation, which is nearly as bad for automatic detection as having none
> at all. Even worse is that a simple MPI job will also hang - mpirun does not
> return because a node has died, only if any of the MPI processes has died,
> providing the appropriate option is set.
>
> What I desire (or at least would like very much) is if an executive process
> could notice when the LAM nodeset has broken, via LAM itself, and so issue
> any appropriate cleaning and rebooting commands to reinstate the
> multi-computer.
>
> I've tried trapping signals with success but they aren't raised during
> sudden-death situations. In fact there doesn't seem to be anything that
> relates to the 'heartbeat' mentioned in the brief descriptions of the
> fault-tolerant mode that I have found - even in the context of the dynamic
> spawning MPI-2 commands.
>
> again, sorry for the double post..
> j.
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
  Brian Barrett
  LAM/MPI developer and all around nice guy
  Have a LAM/MPI day: http://www.lam-mpi.org/