Than you Brian
I was debuging a MPI/LAM program with DDD and noticed that
when a LAMD tells a process to die, DDD catches a SIGUSR2. And then
dies. Question is, if i catch SIGUSR2, will i be able to ignore it and
continue
processing?
Thank you
"Brian W. Barrett" wrote:
> On Sun, 25 Feb 2001, Ruben Valdez Escobedo wrote:
>
> > I'm trying to send signal SIGUDIE to certain process and the
> > process dies and returns some value, causing all process on
> > the cluster to die. How can i send a signal to a process and achieve
> > it's death in the same way that tkill on the machines running the
> > process
> > does?
>
> As you noted, sending a SIGUDIE to a LAM process will cause it to
> die. The lamd notices that a process dies and tells all the lamds that a
> process died, so kill their processes in that job and clean up after the
> mess. So, you are seeing expected behavior for how the lamds are setup.
>
> The quickest way that I can think of to acheive what you want is to
> deliever a signal other than SIGUDIE, and have a lam signal handler
> installed to catch the signal. Fork and exec tkill to kill the lamd, then
> kill yourself. That's about the best I can think of.
>
> > Where i can find more information about Fault Tolerance characteristics
> > of LAM?
>
> Right now, there is next to nothing in terms of documentation. Only the
> small bits and pieces on this mailing list and the code. Probably the
> best place to start code diving is /otb/sys/dli/ and /otb/sys/dlo/,
>
> Hope that helps,
>
> Brian
>
> --
> Brian Barrett
> http://www.nd.edu/~bbarrett
> University of Notre Dame Class of 2001
> Department of Computer Science and Engineering
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Ruben Jesus Valdez Escobedo - CCSI / MCT
ITESM, Campus Monterrey
CETEC, Torre Norte, 6o. Piso +52 01 (8) 358-1400 ext. 5014
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|