Hi,
These days i do some experience on the fault-tolerance with LAM-MPI and BLCR. I found that it can work good usually, I killed the MPI program through crtl+c, and it can be restarted by "cr_restart context.xxxx".
But i found that a MPI application can't be restart if the LAM RTE(Run Time Environment) is restart also. I tested like this:
node01:> lamboot
node01:> mpirun -np 2 xhpl
node01:> cr_checkpoint xxxx
node01:> crtl+c (to terminate the MPI program)
node01:> lamhalt
node01:> lamboot
node01:> cr_restart context.xxxx
It gave the message like this:
cri_syscall(CR_OP_RSTRT_REAP): No such file or directory
cri_syscall(CR_OP_RSTRT_REAP): No such file or directory
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
I traced the source code, and found the reason:
When executing lamhalt and lamboot, the LAM RTE is reboot, and the "/tmp/lam-xxx_at_node01" directory is reestablished. And then cr_restart is executed, the BLCR module need to reopen the file "/tmp/lam-xxx_at_node01/lam-crtcp-rank-1.txt", BLCR can't find it and failed to restart the MPI program.
The problem is that for the usage of fault-tolerance, one node is failure and reboot, and how can I restart the execution without restart the LAM RTE? Can i add the rebooted node to the remain LAM university without execution of lamhalt and lamboot?
Thanks.
Liu Xuezhao
2005-12-14
|