We use LAM on Linux clusters in production mode. The LAM is booted at
beginning of the production, and stay on for weeks to months until next
reboot.
Many big(>100 tasks) and small(1 task) jobs run concurrently on this
cluster
sharing the same LAM.
Our problem is that sometimes, the lamd crashes on some nodes, while
other
nodes are still OK, and we may have jobs running on other nodes and
can't reboot
the LAM.
Is it possible to recover these lost lamd on the nodes ?
lamshrink will not return if the lamds are already lost.
Thanks in advance for all your help.
Lily
|