LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-07-19 08:47:28


On Jul 18, 2005, at 4:57 PM, Pradeep Padala wrote:

> I am trying to understand how LAM/MPI with BLCR works Torque +
> Maui.
> I want to submit an MPI job through Maui+Torque, checkpoint it, kill it
> and restart it later. This is what I am doing.
>
> I have setup two FC4 machines pretty much identical with Torque+Maui
> and
> LAM/MPI combined with blcr
> 1. I am using a submission script similar to follows
> #PBS -l nodes=2:ppn=1
> lamboot $PBS_NODEFILE
> mpirun -ssi rpi crtcp -np 2 hello
> lamhalt
> 2. I run lamcheckpoint on the node where mpirun runs and I get a single
> context file in my home directory. According to the documentation,
> context files for each mpi job should be generated.

Correct.

You might want to try this interactively (e.g., use "qsub -I ...") and
see if you can get the same results.

Also double check that your LD_LIBRARY_PATH set properly on all nodes
such that the BLCR library can be found.

> 3. Now, the tasks just continue. How do I induce faults here? If I kill
> any of the tasks (the hello programs in this case), lamd thinks
> that
> the job is done and exits, cleaning up the context file as well.

None of the LAM infrastructure should ever remove checkpoint files,
except when replacing them with new ones (e.g., when you checkpoint a
parallel job a second time).

> My question is, how do I induce faults? Should I kill lamd? Is there
> anyway I can kill the process with lamcheckpoint, something like
> lamcheckpoint --kill <mpirun pid>.

LAM doesn't currently support such an option. Another way of doing it,
however, would be to run lamcheckpoint and then lamclean.

> 4. How's the lamrestart supposed to be used? When I run lamrestart on
> the context file generated, nothing happens. No new jobs in Torque,
> no new processes on the machine. Can lamrestart submit the
> (restarted) job back to Torque?

No. lamrestart will only restart the LAM job. Hence, you must
manually obtain a new Torque job and setup a new LAM universe (or
re-use the old one) and then you can lamrestart the old job.

> 5. If I try cr_restart on the context file, I get a seg fault.

I suspect that the problem is that you're not getting context files for
all the parallel processes, and this leads to badness during the
restart process. However, this could be due to BLCR itself failing.

Are you able to checkpoint / restart serial processes?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/