Hi,
Sorry for the delayed reply. It is possible to recover the lost Lamds
without having to shutdown/restart the whole system. To do this, always
boot up in the fault tolerant mode:
lamboot -x <host-file>
This ensures heart-beats are exchanged between the lamd's periodically to
make sure that all the nodes are up and running. If a failure is detected,
then the survivng lamd's and processes are notified. This graceful
degradation or recovery using lamgrow. Please see the man page for lamboot
for more details. Note that it does take sometime for the crashes to be
detected.
Once the crashes have been detected, one can use lamnodes to see which of
the nodes have crashed and subsequently use lamgrow to add these back to
the LAM universe. For eg:
#cat hostfile
dummy1.x.edu
dummy2.x.edu
dummy3.x.edu
dummy4.x.edu
#lamboot -x hostfile
#lamnodes
n0 dummy1.x.edu:1:origin, this_node
n1 dummy2.x.edu:1:
n2 dummy3.x.edu:1:
n3 dummy4.x.edu:1:
Now suppose Lamd on dummy3 fails, then after sometime, lamnodes command
will give you the following output
#lamnodes
n0 dummy1.x.edu:1:origin, this_node
n1 dummy2.x.edu:1:
n2 dummy3.x.edu:1:invalid node
n3 dummy4.x.edu:1:
Then, you can restart only the failed nodes using the lamgrow command
#lamgrow dummy3.x.edu
#lamnodes
n0 dummy1.x.edu:1:origin, this_node
n1 dummy2.x.edu:1:
n2 dummy3.x.edu:1:
n3 dummy4.x.edu:1:invalid node
n4 dummy3.x.edu:1:
Hope this helps,
Anju
On Wed, 16 Jun 2004, Lily Li wrote:
>
> We use LAM on Linux clusters in production mode. The LAM is booted at
> beginning of the production, and stay on for weeks to months until next
> reboot.
> Many big(>100 tasks) and small(1 task) jobs run concurrently on this
> cluster
> sharing the same LAM.
>
> Our problem is that sometimes, the lamd crashes on some nodes, while
> other
> nodes are still OK, and we may have jobs running on other nodes and
> can't reboot
> the LAM.
>
> Is it possible to recover these lost lamd on the nodes ?
>
> lamshrink will not return if the lamds are already lost.
>
> Thanks in advance for all your help.
>
> Lily
>
|