Hi,
We know there is no full support of fault tolerance in MPI-2. But as the
FAULT program in the examples of lam-7.1.1 directs a way to support fault
tolerance, even partially, I think master/slave architecture is a good
choice for fault tolerance.
So I tried the main architecture of FAULT program and almost change
nothing in the preparation for master and slaves codes. In the README, it
is said "executing the 'tkill' program on a slave node" shows LAM/MPI can
continue. Yes it does. It works well.
But, if I execute tkill at two nodes, the master program will hang on at
the end of the job queue, until I have to press CTRL+C to stop it. Another
experiment I have tried is to remove the network wireline of a slave node,
the same thing happened but the master program can continue once I put the
wireline back.
I debugged the code and found it's MPI_Waitany hanging on. So I really
want to know why this happen and how to resolve it, to support more than
one "tkill" execution on different nodes.
Thank you all!
Jason
_________________________________________________________________
ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger: http://messenger.msn.com/cn
|