LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-07-19 10:55:07


On Jul 19, 2005, at 11:38 AM, Pradeep Padala wrote:

> There seems to be some problem with blcr as I get only one context
> file.
> Does LAM/MPI depend on a particular version of blcr?

No.

>> No. lamrestart will only restart the LAM job. Hence, you must
>> manually obtain a new Torque job and setup a new LAM universe (or
>> re-use the old one) and then you can lamrestart the old job.
>
> Let me get this straight. I lamcheckpoint the job, collect the context
> files, run lamclean and run a new job that contains a lamrestart. Am I
> right?

Correct.

Also note that we only support restarting in the same topology that you
were using before. Specifically, the MPI_COMM_WORLD ranks must be on
the same relative nodes that they were before, and support the same
RPI's. So if you had a topology like this:

        n0: MPI_COMM_WORLD ranks 0, 1, 2, 3
        n1: MCW ranks 4, 5
        n2: MCW rank 6
        n3: MCW rank 7

You have to use the same topology on the restart.

>> Are you able to checkpoint / restart serial processes?
>
> Yes. Actually, I have integrated Torque+Maui with blcr as I am working
> on some fault tolerance research. It would be great if I can checkpoint
> LAM/MPI jobs as well.

What version of LAM/MPI are you using? (I should have asked this in my
prior mail -- sorry)

There was a problem with BLCR support in 7.1 and 7.1.1 -- we fixed it a
while ago in the 7.1.2 betas (see http://www.lam-mpi.org/beta/). This
might well be your problem -- that BLCR support was effectively ignored
in the MPI processes and therefore you only got the context file for
mpirun.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/