On Dec 8, 2004, at 5:45 AM, Roderick Johnstone wrote:
> We've been using LAM for a little over a year.
>
> I've just updated the nodes we run lam on to Fedora Core 2 and lam to
> 7.1.1, and we are now seeing a problem which was not present before.
> We run these diskless nodes with an nfs mounted root and /tmp. To
> provoke the problem:
>
> 1) lamboot a universe with say 4 nodes
> 2) lamhalt
> 3) Each node in the lam universe now has tkill running at 100% cpu
>
> I can run MPI jobs between 1) and 2) fine.
> After 2, the lamd seems to be shut down fine.
>
> I've attached an strace to one of the tkill processes and its in a
> tight loop trying to unlink eg .nfs006a081000000345 in the
> /tmp/lam-rmj_at_blah directory.
Hi -
I think we found the problem - tkill was inheriting that file
descriptor from the lamd that started it. I've added a fix to SVN.
Can you try our nightly tarball and see if the problem persists?
http://www.lam-mpi.org/svn/
Thanks,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|