Can you try running your application with the CR SSI parameters set:
shell$ mpirun n0-1 -ssi cr blcr -ssi rpi crtcp cpi
shell$ clamcheckpoint -ssi cr blcr -pid 2674
With LAM/MPI you need make sure you explicitly use the
'crtcp' (versus the 'tcp') RPI since it contains the distributed
coordination protocol.
Let me know if this helps,
Josh
On Mar 23, 2007, at 12:44 PM, Crazy Fox wrote:
> Hi,
>
> I am working with a four node P3 cluster. I have installed lam
> 7.1.3 with blcr support (blcr-0.5.0). I tried to checkpoint one of
> the example mpi application that comes along with lam. I had no
> problem when checkpointing and restarting (using blcr module) on a
> single node. I lamboot'ed with two nodes and checkpoint / restart
> worked with mpirun from n0 when using mpirun to use one node ( i.e
> mpirun n0 cpi & mpirun ni cpi). When i run with two nodes i am not
> able to do restart. Checkpoint works and chontext.mpirun and two
> more context files correponding to the individual nodes cpi process
> gets created. But when i try to restart i get some lam specific
> errors. I would be great if someone can help me in this regard to
> help me get LAM + BLCR checkpoint/restart working on multiple
> nodes. Here are the sequences of operations i did...
>
> $mpirun n0-1 cpi
>
> $lamcheckpoint -ssi cr blcr -pid 2674
>
> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.mpirun.
> 2674
> MPI_Recv: process in local group is dead (rank 0, MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Recv()
> Rank (0, MPI_COMM_WORLD): - MPI_Reduce()
> Rank (0, MPI_COMM_WORLD): - main()
> ----------------------------------------------------------------------
> -------
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> I was getting this error even though there are context files for
> mpirun and sepereate context file on $HOME for the two cpi process
> on nodes no and n1. I sometime get the same error with Rank 1 also
> instead of Rank 0 and sometimes both. And after trying lamrestart
> for about 5-10 times I find a lot of cr_restart process on top (on
> the other node n1) and all of them are zombies. After that i get
> these errors too ....
>
> fork(): Resource temporarily unavailable
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
>
>
> I dont know how to proceed to get lamrestart working on my
> cluster. Somebody help me in getting LAM + BLCR up. Thanks in advance.
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/
|