LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-06-08 00:11:26


On May 29, 2006, at 6:27 AM, J J wrote:

> I am facing two problems with lam on my cluster. I am presently
> using an 8 node cluster running oscar.
> 1: The first problem is that if any node goes down due to
> failures , then upon rebooting lam is not identifying it. I mean
> lamnodes shows the old o/p only even the node was down and the
> rebooted node is not running lamd. Is it the problem related to
> oscar or lam? I am using lam 7.1.2

The default behavior in LAM is to assume nodes never fail. If you
want to run LAM in a situation where the lamds detect the failure,
you can add the -x option to lamboot. This will activate some code
to allow the lam daemons to detect node failures and automatically
shrink the universe.

> 2: The second problem is little bit diverted from the relevancy of
> this group. On restarting a checkpointed job, it fails if lam
> runtime environment has changed. i.e. if lamboot is invoked after
> checkpointing but before restarting the application, It fails.
> Hope some body can help me out.

How does it fail, exactly?

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/