On May 29, 2006, at 6:27 AM, J J wrote:
> I am facing two problems with lam on my cluster. I am presently
> using an 8 node cluster running oscar.
> 1: The first problem is that if any node goes down due to
> failures , then upon rebooting lam is not identifying it. I mean
> lamnodes shows the old o/p only even the node was down and the
> rebooted node is not running lamd. Is it the problem related to
> oscar or lam? I am using lam 7.1.2
The default behavior in LAM is to assume nodes never fail. If you
want to run LAM in a situation where the lamds detect the failure,
you can add the -x option to lamboot. This will activate some code
to allow the lam daemons to detect node failures and automatically
shrink the universe.
> 2: The second problem is little bit diverted from the relevancy of
> this group. On restarting a checkpointed job, it fails if lam
> runtime environment has changed. i.e. if lamboot is invoked after
> checkpointing but before restarting the application, It fails.
> Hope some body can help me out.
How does it fail, exactly?
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|