LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: zhang zhenxin (zhangzhenxin_at_[hidden])
Date: 2005-07-04 01:21:12


Hi,
   We know there is no full support of fault tolerance in MPI-2. But as the
FAULT program in the examples of lam-7.1.1 directs a way to support fault
tolerance, even partially, I think master/slave architecture is a good
choice for fault tolerance.
   So I tried the main architecture of FAULT program and almost change
nothing in the preparation for master and slaves codes. In the README, it
is said "executing the 'tkill' program on a slave node" shows LAM/MPI can
continue. Yes it does. It works well.
   But, if I execute tkill at two nodes, the master program will hang on at
the end of the job queue, until I have to press CTRL+C to stop it. Another
experiment I have tried is to remove the network wireline of a slave node,
the same thing happened but the master program can continue once I put the
wireline back.
   I debugged the code and found it's MPI_Waitany hanging on. So I really
want to know why this happen and how to resolve it, to support more than
one "tkill" execution on different nodes.

Thank you all!

Jason

_________________________________________________________________
ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger: http://messenger.msn.com/cn