Hi,
I am working with a four node P3 cluster. I have installed lam
7.1.3with blcr support (
blcr-0.5.0). I tried to checkpoint one of the example mpi application that
comes along with lam. I had no problem when checkpointing and restarting
(using blcr module) on a single node. I lamboot'ed with two nodes and
checkpoint / restart worked with mpirun from n0 when using mpirun to use one
node (i.e mpirun n0 cpi & mpirun ni cpi). When i run with two nodes i am not
able to do restart. Checkpoint works and chontext.mpirun and two more
context files correponding to the individual nodes cpi process gets created.
But when i try to restart i get some lam specific errors. I would be great
if someone can help me in this regard to help me get LAM + BLCR
checkpoint/restart working on multiple nodes. Here are the sequences of
operations i did...
$mpirun n0-1 cpi
$lamcheckpoint -ssi cr blcr -pid 2674
$ lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.mpirun.2674
MPI_Recv: process in local group is dead (rank 0, MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Reduce()
Rank (0, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
I was getting this error even though there are context files for mpirun
and sepereate context file on $HOME for the two cpi process on nodes no and
n1. I sometime get the same error with Rank 1 also instead of Rank 0 and
sometimes both. And after trying lamrestart for about 5-10 times I find a
lot of cr_restart process on top (on the other node n1) and all of them are
zombies. After that i get these errors too ....
fork(): Resource temporarily unavailable
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
I dont know how to proceed to get lamrestart working on my cluster.
Somebody help me in getting LAM + BLCR up. Thanks in advance.
|