LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (bbarrett_at_[hidden])
Date: 2001-02-27 09:17:07


On Sun, 25 Feb 2001, Ruben Valdez Escobedo wrote:

> I'm trying to send signal SIGUDIE to certain process and the
> process dies and returns some value, causing all process on
> the cluster to die. How can i send a signal to a process and achieve
> it's death in the same way that tkill on the machines running the
> process
> does?

As you noted, sending a SIGUDIE to a LAM process will cause it to
die. The lamd notices that a process dies and tells all the lamds that a
process died, so kill their processes in that job and clean up after the
mess. So, you are seeing expected behavior for how the lamds are setup.

The quickest way that I can think of to acheive what you want is to
deliever a signal other than SIGUDIE, and have a lam signal handler
installed to catch the signal. Fork and exec tkill to kill the lamd, then
kill yourself. That's about the best I can think of.

> Where i can find more information about Fault Tolerance characteristics
> of LAM?

Right now, there is next to nothing in terms of documentation. Only the
small bits and pieces on this mailing list and the code. Probably the
best place to start code diving is /otb/sys/dli/ and /otb/sys/dlo/,

Hope that helps,

Brian

--
 Brian Barrett
 http://www.nd.edu/~bbarrett
 University of Notre Dame Class of 2001
 Department of Computer Science and Engineering
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/