On Jul 19, 2005, at 11:38 AM, Pradeep Padala wrote:
> There seems to be some problem with blcr as I get only one context
> file.
> Does LAM/MPI depend on a particular version of blcr?
No.
>> No. lamrestart will only restart the LAM job. Hence, you must
>> manually obtain a new Torque job and setup a new LAM universe (or
>> re-use the old one) and then you can lamrestart the old job.
>
> Let me get this straight. I lamcheckpoint the job, collect the context
> files, run lamclean and run a new job that contains a lamrestart. Am I
> right?
Correct.
Also note that we only support restarting in the same topology that you
were using before. Specifically, the MPI_COMM_WORLD ranks must be on
the same relative nodes that they were before, and support the same
RPI's. So if you had a topology like this:
n0: MPI_COMM_WORLD ranks 0, 1, 2, 3
n1: MCW ranks 4, 5
n2: MCW rank 6
n3: MCW rank 7
You have to use the same topology on the restart.
>> Are you able to checkpoint / restart serial processes?
>
> Yes. Actually, I have integrated Torque+Maui with blcr as I am working
> on some fault tolerance research. It would be great if I can checkpoint
> LAM/MPI jobs as well.
What version of LAM/MPI are you using? (I should have asked this in my
prior mail -- sorry)
There was a problem with BLCR support in 7.1 and 7.1.1 -- we fixed it a
while ago in the 7.1.2 betas (see http://www.lam-mpi.org/beta/). This
might well be your problem -- that BLCR support was effectively ignored
in the MPI processes and therefore you only got the context file for
mpirun.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|