Thanks a lot Josh. I will try lamrestart with a different node in the way u
said, once I get access to the cluster.
On 3/29/07, Josh Hursey <jjhursey_at_[hidden]> wrote:
>
>
> On Mar 28, 2007, at 3:30 PM, Nannan Ayya wrote:
>
> > Hi josh,
> > That did help THANKS. With the crtcp modules the lamchekpoint
> > and lamrestart worked.
>
> Awesome :)
>
> > I have another question. If a node fails can i use another idle
> > node to fill its position. Would copying the context file context.
> > 31219-n0-31220 from node n0 and the placing in $HOME of n3 with
> > appropriate file name help ????
>
> LAM/MPI currently does not support the automatic migration of
> processes from a failed machine to another machine in the allocation
> within the same execution of mpirun. LAM/MPI requires that, upon
> failure of a node, the MPI job is fully terminated. You can (or
> should be able to -- I haven't verified this) restart within the
> same LAM universe using the "lamshrink" command to remove the failed
> node from the universe.
>
> If you do not have a shared file system then you will need to
> manually move the checkpoint file from the failed machine to the
> machine on which it is restarted. LAM/MPI relies on the assumption
> that checkpoint files are stored on a globally mounted directory.
>
> > I guess it wont (and i tried it also) but i am very interested to
> > know what modifications and where, should be modified to make such
> > migration possible. I want to be able to copy the checkpoint file
> > from one node to another in case of failure (even manually is fine)
> > and the do the same lamrestart and get the mpi job running with the
> > new node replacing the old failure node. I would be greatful if you
> > can give me some clue as to where the schema (specifying the new
> > node) has to me mentioned when doing lamrestart to support this
> > kind of migration.
>
> In the current version of LAM/MPI you will need to:
> 1. Kill the mpirun command if it has not already terminated
> 2. Clean the LAM environment (using lamwipe or such commands)
> 3. Use 'lamshrink' to remove the failed machine from the LAM universe
> 4. Optionally "lamgrow" the LAM universe adding a replacement machine
> 5. lamrestart your application from checkpoint.
>
> This should work, but I have not tried this specific scenario in
> quite a long time.
>
> If you are interested in doing automatic migration of processes
> within a single run of mpirun then this will unfortunately require
> quite a bit of changes to the internals of LAM/MPI. Recently a group
> from NCSU and ORNL published a paper at IPDPS 2007 in which they seem
> to have this working with LAM/MPI that you may be interested in:
> A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
> http://www.ipdps.org/ipdps2007/2007_advance_program.html
>
> I hope this helps a bit,
> -- Josh
>
> > Thanks
> >
> > On 3/28/07, Josh Hursey <jjhursey_at_[hidden]> wrote: Can you try
> > running your application with the CR SSI parameters set:
> > shell$ mpirun n0-1 -ssi cr blcr -ssi rpi crtcp cpi
> > shell$ clamcheckpoint -ssi cr blcr -pid 2674
> >
> > With LAM/MPI you need make sure you explicitly use the
> > 'crtcp' (versus the 'tcp') RPI since it contains the distributed
> > coordination protocol.
> >
> > Let me know if this helps,
> > Josh
> >
> > On Mar 23, 2007, at 12:44 PM, Crazy Fox wrote:
> >
> > > Hi,
> > >
> > > I am working with a four node P3 cluster. I have installed lam
> > > 7.1.3 with blcr support (blcr-0.5.0). I tried to checkpoint one of
> > > the example mpi application that comes along with lam. I had no
> > > problem when checkpointing and restarting (using blcr module) on a
> > > single node. I lamboot'ed with two nodes and checkpoint / restart
> > > worked with mpirun from n0 when using mpirun to use one node ( i.e
> > > mpirun n0 cpi & mpirun ni cpi). When i run with two nodes i am not
> > > able to do restart. Checkpoint works and chontext.mpirun and two
> > > more context files correponding to the individual nodes cpi process
> > > gets created. But when i try to restart i get some lam specific
> > > errors. I would be great if someone can help me in this regard to
> > > help me get LAM + BLCR checkpoint/restart working on multiple
> > > nodes. Here are the sequences of operations i did...
> > >
> > > $mpirun n0-1 cpi
> > >
> > > $lamcheckpoint -ssi cr blcr -pid 2674
> > >
> > > $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.mpirun.
> > > 2674
> > > MPI_Recv: process in local group is dead (rank 0, MPI_COMM_WORLD)
> > > Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> > > Rank (0, MPI_COMM_WORLD): - MPI_Recv()
> > > Rank (0, MPI_COMM_WORLD): - MPI_Reduce()
> > > Rank (0, MPI_COMM_WORLD): - main()
> > >
> > ----------------------------------------------------------------------
> > > -------
> > > It seems that [at least] one of the processes that was started with
> > > mpirun did not invoke MPI_INIT before quitting (it is possible that
> > > more than one process did not invoke MPI_INIT -- mpirun was only
> > > notified of the first one, which was on node n0).
> > >
> > > I was getting this error even though there are context files for
> > > mpirun and sepereate context file on $HOME for the two cpi process
> > > on nodes no and n1. I sometime get the same error with Rank 1 also
> > > instead of Rank 0 and sometimes both. And after trying lamrestart
> > > for about 5-10 times I find a lot of cr_restart process on top (on
> > > the other node n1) and all of them are zombies. After that i get
> > > these errors too ....
> > >
> > > fork(): Resource temporarily unavailable
> > > mpirun can *only* be used with MPI programs (i.e., programs that
> > > invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec"
> > program
> > > to run non-MPI programs over the lambooted nodes.
> > >
> > >
> > > I dont know how to proceed to get lamrestart working on my
> > > cluster. Somebody help me in getting LAM + BLCR up. Thanks in
> > advance.
> > >
> > >
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> > ----
> > Josh Hursey
> > jjhursey_at_[hidden]
> > http://www.open-mpi.org/
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.open-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|