Hi
We've been using LAM for a little over a year.
I've just updated the nodes we run lam on to Fedora Core 2 and lam to
7.1.1, and we are now seeing a problem which was not present before. We
run these diskless nodes with an nfs mounted root and /tmp. To provoke
the problem:
1) lamboot a universe with say 4 nodes
2) lamhalt
3) Each node in the lam universe now has tkill running at 100% cpu
I can run MPI jobs between 1) and 2) fine.
After 2, the lamd seems to be shut down fine.
I've attached an strace to one of the tkill processes and its in a tight
loop trying to unlink eg .nfs006a081000000345 in the /tmp/lam-rmj_at_blah
directory.
This file has the same inode number as the file lam-kernel-socket had
before I ran tkill.
I've also tried the following:
1) lamboot a universe with say 4 nodes
2) run tkill on each node by hand
Under these circumstances tkill just works.
It looks like something is keeping the lam-kernel-socket open when tkill
is run from lamhalt, so that the tmp directory can't be emptied and
tkill goes into an infinite loop.
Can anyone suggest a fix please.
Thanks
Roderick
ps: I've checked the mailing list archives a bit and the problem looks a
little like this one from 2003:
http://www.lam-mpi.org/MailArchives/lam/msg05793.php
but we are some way down the line from the fixes that fixed that.
|