Hello,
We are trying to build a manager/worker application which is fault
tolerant towards the death of workers.
>From an earlier posting to this list (
http://www.lam-mpi.org/MailArchives/lam/msg04980.php) the advice
was taken to use seperate communicators for each worker, as also
exemplified in the "fault" application
included with lam.
The critical section of the code is the MPI_Waitany(..) statement that
receives a list of communicators
and returns an error if one of the workers in a communicator fails.
If MPI_Waitany(..) returns with an error and an index w >= 0, then
worker w has died and it's communicator is
set to MPI_REQUEST_NULL.
This kind of fault tolerance works fine for the death of a single
worker. If two or more workers die,
MPI_Waitany does not return and waits forever. (See program output
below).
Does anybody have experiences something similar? Any hints on where to
look for advice?
Your help is appreciated!
- Michael
__
In case only one worker dies, the other slaves finish all the jobs.
When all workers (except for the dead one)
have their communicator to MPI_REQUEST_NULL, waitAny detects the death
of slave 1:
communicators before waitAny
slaveCommunicator[0] = MPI_REQUEST_NULL
slaveCommunicator[1] = not null
slaveCommunicator[2] = MPI_REQUEST_NULL
slaveCommunicator[3] = MPI_REQUEST_NULL
waiting for slaves to complete requests
waitAny: err: 22 Source: -32766 Tag: -32766 Error: 32022
waitAny: MPI_Error_string: process in remote group is dead
In case two workers die, the others finish, but their death is not
detected.
communicators before waitAny:
slaveCommunicator[0] = MPI_REQUEST_NULL
slaveCommunicator[1] = not null
slaveCommunicator[2] = MPI_REQUEST_NULL
slaveCommunicator[3] = not null
MasterInterleaved::run : waiting for slaves to complete requests
- waits for ever -
___________________________________________________________________
Disclaimer:
Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger
bestimmt.
Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird,
bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden
und anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige
Kopien zu vernichten bzw. zu loeschen. Der Gebrauch der Information
ist verboten.
This message is intended only for the named recipient and may contain
confidential or privileged information.
If you have received it in error, please advise the sender by return
e-mail and delete this message and any attachments. Any unauthorised
use or dissemination of this information is strictly prohibited.
|