LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2007-03-29 13:52:55


On Mar 28, 2007, at 3:30 PM, Nannan Ayya wrote:

> Hi josh,
> That did help THANKS. With the crtcp modules the lamchekpoint
> and lamrestart worked.

Awesome :)

> I have another question. If a node fails can i use another idle
> node to fill its position. Would copying the context file context.
> 31219-n0-31220 from node n0 and the placing in $HOME of n3 with
> appropriate file name help ????

LAM/MPI currently does not support the automatic migration of
processes from a failed machine to another machine in the allocation
within the same execution of mpirun. LAM/MPI requires that, upon
failure of a node, the MPI job is fully terminated. You can (or
should be able to -- I haven't verified this) restart within the
same LAM universe using the "lamshrink" command to remove the failed
node from the universe.

If you do not have a shared file system then you will need to
manually move the checkpoint file from the failed machine to the
machine on which it is restarted. LAM/MPI relies on the assumption
that checkpoint files are stored on a globally mounted directory.

> I guess it wont (and i tried it also) but i am very interested to
> know what modifications and where, should be modified to make such
> migration possible. I want to be able to copy the checkpoint file
> from one node to another in case of failure (even manually is fine)
> and the do the same lamrestart and get the mpi job running with the
> new node replacing the old failure node. I would be greatful if you
> can give me some clue as to where the schema (specifying the new
> node) has to me mentioned when doing lamrestart to support this
> kind of migration.

In the current version of LAM/MPI you will need to:
  1. Kill the mpirun command if it has not already terminated
  2. Clean the LAM environment (using lamwipe or such commands)
  3. Use 'lamshrink' to remove the failed machine from the LAM universe
  4. Optionally "lamgrow" the LAM universe adding a replacement machine
  5. lamrestart your application from checkpoint.

This should work, but I have not tried this specific scenario in
quite a long time.

If you are interested in doing automatic migration of processes
within a single run of mpirun then this will unfortunately require
quite a bit of changes to the internals of LAM/MPI. Recently a group
from NCSU and ORNL published a paper at IPDPS 2007 in which they seem
to have this working with LAM/MPI that you may be interested in:
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
http://www.ipdps.org/ipdps2007/2007_advance_program.html

I hope this helps a bit,
-- Josh

> Thanks
>
> On 3/28/07, Josh Hursey <jjhursey_at_[hidden]> wrote: Can you try
> running your application with the CR SSI parameters set:
> shell$ mpirun n0-1 -ssi cr blcr -ssi rpi crtcp cpi
> shell$ clamcheckpoint -ssi cr blcr -pid 2674
>
> With LAM/MPI you need make sure you explicitly use the
> 'crtcp' (versus the 'tcp') RPI since it contains the distributed
> coordination protocol.
>
> Let me know if this helps,
> Josh
>
> On Mar 23, 2007, at 12:44 PM, Crazy Fox wrote:
>
> > Hi,
> >
> > I am working with a four node P3 cluster. I have installed lam
> > 7.1.3 with blcr support (blcr-0.5.0). I tried to checkpoint one of
> > the example mpi application that comes along with lam. I had no
> > problem when checkpointing and restarting (using blcr module) on a
> > single node. I lamboot'ed with two nodes and checkpoint / restart
> > worked with mpirun from n0 when using mpirun to use one node ( i.e
> > mpirun n0 cpi & mpirun ni cpi). When i run with two nodes i am not
> > able to do restart. Checkpoint works and chontext.mpirun and two
> > more context files correponding to the individual nodes cpi process
> > gets created. But when i try to restart i get some lam specific
> > errors. I would be great if someone can help me in this regard to
> > help me get LAM + BLCR checkpoint/restart working on multiple
> > nodes. Here are the sequences of operations i did...
> >
> > $mpirun n0-1 cpi
> >
> > $lamcheckpoint -ssi cr blcr -pid 2674
> >
> > $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file context.mpirun.
> > 2674
> > MPI_Recv: process in local group is dead (rank 0, MPI_COMM_WORLD)
> > Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (0, MPI_COMM_WORLD): - MPI_Recv()
> > Rank (0, MPI_COMM_WORLD): - MPI_Reduce()
> > Rank (0, MPI_COMM_WORLD): - main()
> >
> ----------------------------------------------------------------------
> > -------
> > It seems that [at least] one of the processes that was started with
> > mpirun did not invoke MPI_INIT before quitting (it is possible that
> > more than one process did not invoke MPI_INIT -- mpirun was only
> > notified of the first one, which was on node n0).
> >
> > I was getting this error even though there are context files for
> > mpirun and sepereate context file on $HOME for the two cpi process
> > on nodes no and n1. I sometime get the same error with Rank 1 also
> > instead of Rank 0 and sometimes both. And after trying lamrestart
> > for about 5-10 times I find a lot of cr_restart process on top (on
> > the other node n1) and all of them are zombies. After that i get
> > these errors too ....
> >
> > fork(): Resource temporarily unavailable
> > mpirun can *only* be used with MPI programs (i.e., programs that
> > invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec"
> program
> > to run non-MPI programs over the lambooted nodes.
> >
> >
> > I dont know how to proceed to get lamrestart working on my
> > cluster. Somebody help me in getting LAM + BLCR up. Thanks in
> advance.
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.open-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/