Hello again,
I am still trying to get my application work in case of more than one
slave crashing.
I checked again the example I started off with, taken from the lam
distribution:
lam-7.0.4/examples/main/fault. I added "sleep(1)" near line 88 of
slave.c
to get a chance to interrupt the slave. Starting 4 slaves and killing
two did not
result in fault-tolerant behavior, as noted in the readme (see below)
and as it
happens when only one slave is killed.
Can anybody confirm this behavior?
Does anybody have an idea why the death of the slave is not detected?
What else to
try?
I started lam with "lamboot -x" to enable the hearbeat, and run the
programm with
"mpirun n0 -ssi rpi tcp ./master" to make sure tcp is used (which can
detect broken
connections). I tried to set the error handler to MPI_ERRORS_RETURN
and writing
my own error handler.
Are there any other option related to fault tolerance, that I am not
aware of?
Your help is very much appreciated, I am really stuck!
- Michael
___
>From the README of the fault example:
>This application contains some degree of fault tolerance. Slave
>*nodes* can die and the application will continue with less slaves,
as
>long as one slave is alive.
___________________________________________________________________
Disclaimer:
Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger
bestimmt.
Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird,
bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden
und anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige
Kopien zu vernichten bzw. zu loeschen. Der Gebrauch der Information
ist verboten.
This message is intended only for the named recipient and may contain
confidential or privileged information.
If you have received it in error, please advise the sender by return
e-mail and delete this message and any attachments. Any unauthorised
use or dissemination of this information is strictly prohibited.
|