LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Liu Xuezhao (lxz_at_[hidden])
Date: 2005-12-13 22:14:12


Hi,
  These days i do some experience on the fault-tolerance with LAM-MPI and BLCR. I found that it can work good usually, I killed the MPI program through crtl+c, and it can be restarted by "cr_restart context.xxxx".
  But i found that a MPI application can't be restart if the LAM RTE(Run Time Environment) is restart also. I tested like this:
  node01:> lamboot
  node01:> mpirun -np 2 xhpl
  node01:> cr_checkpoint xxxx
  node01:> crtl+c (to terminate the MPI program)
  node01:> lamhalt
  node01:> lamboot
  node01:> cr_restart context.xxxx
  It gave the message like this:
cri_syscall(CR_OP_RSTRT_REAP): No such file or directory
cri_syscall(CR_OP_RSTRT_REAP): No such file or directory
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
  I traced the source code, and found the reason:
  When executing lamhalt and lamboot, the LAM RTE is reboot, and the "/tmp/lam-xxx_at_node01" directory is reestablished. And then cr_restart is executed, the BLCR module need to reopen the file "/tmp/lam-xxx_at_node01/lam-crtcp-rank-1.txt", BLCR can't find it and failed to restart the MPI program.
  The problem is that for the usage of fault-tolerance, one node is failure and reboot, and how can I restart the execution without restart the LAM RTE? Can i add the rebooted node to the remain LAM university without execution of lamhalt and lamboot?
  Thanks.

Liu Xuezhao
2005-12-14