LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-01-05 16:21:21


On Jan 4, 2006, at 1:39 AM, Liu Xuezhao wrote:

> I met this problem also, I think this is a bug of lam-7.1.1 but
> it has not been confirmed by LAM developers.
> You can use the previous version of LAM 7.0.6, but i have found
> 7.0.6 has another bug also.( It will produce the file "/tmp/lam-
> xxx_at_node01/lam-crtcp-rank-0.txt" which will lead the execution
> can't be restarted after the lamhalt and lamboot is executed).

Sorry for the delay in replying to this -- the holidays and other
fires prevented me from replying before now.

Yes, as Brian mentioned, 7.0.6 inadvertently created the logfile in
the LAM session directory, preventing you from restarting in a new
LAM universe. The problem was fixed somewhere along the way -- I
don't remember offhand if it was in 7.1 or 7.1.1, but I know that
7.1.1 does not have that problem.

> It seems that fault-tolerance is not a essentially important
> feature of LAM-MPI? Can LAM's developers tell us something about
> the project plan about "fault-tolerance" of LAM-MPI?

LAM/MPI is pretty much in a maintenance mode -- we are spending the
vast majority of our time on Open MPI these days (see the notice on
the front page of the LAM web site). LAM is certainly not going away
-- we will continue to provide bug fixes, etc. But little new work
is happening in LAM.

In Open MPI, we plan to continue our FT work as well as branch off in
several new directions of FT. In short, Open MPI is shaping up to be
a much better environment for FT experimentation and research than
LAM was (not that there is anything wrong with LAM -- it's just that
Open MPI was designed with all the experience gained from LAM and
several other systems, and therefore we did it "better").
Eventually, we'll support a variety of FT mechanisms in Open MPI.

Specifically, now that Open MPI is in a fairly stable state, BLCR
support is slated to be added to Open MPI this upcoming spring. Work
for this support is underway, but there's nothing interesting to
report to users yet -- the initial required infrastructure for FT is
being added right now.

>> BTW: The LAM/MPI version I use is 7.1.1
>>
>> And the problem seem to be when I invoke the
>>
>>> cr_checkpoint --term $PID_of_mpirun

Are you sure that blcr support is correctly installed in your LAM
installation? Check the output of lamifo:

shell$ laminfo | grep blcr
   SSI cr: blcr (API v1.0, Module v1.1)

If you see that "SSI" line, then blcr support is properly included in
your LAM installation. The question then becomes why the images were
not properly created when you cr_checkpointed mpirun.

The best way to do this is to turn up the verbosity of the cr system
and ensure that everything is happening properly. For example:

shell$ mpirun -ssi cr_verbose level:1000,stderr -ssi rpi crtcp -ssi
cr blcr -np 2 your_application

You initially should see a bunch of output to stderr indicating that
blcr was selected. When you invoke cr_checkpoint, you should see all
the steps that LAM goes through to checkpoint.

Does that happen for you?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/