Hi All,
I am facing two problems with lam on my cluster. I am presently using an 8 node cluster running oscar.
1: The first problem is that if any node goes down due to failures , then upon rebooting lam is not identifying it. I mean lamnodes shows the old o/p only even the node was down and the rebooted node is not running lamd. Is it the problem related to oscar or lam? I am using lam 7.1.2.
2: The second problem is little bit diverted from the relevancy of this group. On restarting a checkpointed job, it fails if lam runtime environment has changed. i.e. if lamboot is invoked after checkpointing but before restarting the application, It fails.
Hope some body can help me out.
Thanks for your time.
Jane
---------------------------------
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
|