LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeffrey Squyres (jsquyres_at_[hidden])
Date: 2005-07-20 17:38:17


In Jul 19, 2005, at 2:27 PM, Pradeep Padala wrote:

>> It probably is. Sorry; I really should have brought that up in the
>> first mail. :-\
>
> That's ok. I tried the beta version and I still have the same problem.
> I see only one context file. Again, to be clear this is what I am
> doing.

If you're only seeing one context file, then lamrestart will definitely
not work. Some followup questions:

1. When you run laminfo, do you see the blcr module listed? I.e., was
the new beta built with BLCR support properly?

2. When you run MPI apps, do they have LD_LIBRARY_PATH set properly
such that the BLCR library can be found? You might want to try
something like:

        lamexec N env | grep LD_LIBRARY_PATH

Remember that the MPI processes will inherit the environment of the
lamd, which, in a PBS/Torque environment, *should* be the same as the
environment of the process that did lamboot (which, therefore, should
have the Right LD_LIBRARY_PATH -- but it's worth checking).

3. Is BLCR installed in the same directory on all machines in the
cluster (such that PATH and LD_LIBRARY_PATH can be the same to find all
the relevant BLCR parts on all nodes)?

> 1. I submit a job using the following script
> #PBS -l nodes=2:ppn=1
> setenv PATH ${PATH}:/usr/local/bin
> lamboot $PBS_NODEFILE

Random note: you don't need the $PBS_NODEFILE here; LAM will get the
list of nodes directly from PBS/Torque.

> mpirun -ssi rpi crtcp -np 2 /home/ppadala/kabru/progs/hello
> lamhalt
>
> The program is running on the two machines and I can see them with
> ps
>
> 2. I run lamcheckpoint on the machine that's running mpirun. I get one
> context file in my home dir.
>
> 3. I kill the processes (on both machines), Torque and Maui clean up
> their data structures
> 4. I run lamrestart on the context file on the machine where mpirun was
> run. At this stage nothing happens, no new processes or anything.
>
> Am I missing something here? Do I need to lamboot before lamrestart?
> Shouldn't I have lamhalt in the original script? Do I need to kill lamd
> instead of the process?

The real problem is that you're only getting on checkpoint file -- you
should get N+1 (i.e., one for each MPI process and one for mpirun).
Without that, you won't be able to restart properly.

But to answer your questions:

1. No
2. If you don't have an existing LAM universe, yes, you will need to
run lamboot again. But you can certainly restart in the same universe
that you originally ran in, too.
3. lamhalt brings down the LAM universe, so I'm not sure what you're
asking here. In general, it is social to have a lamhalt at the end of
your job submission script.
4. No, killing the process is a good way to simulate faults -- killing
the lamd is an option, but it makes it harder to clean up your LAM
environment properly.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/