LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2004-12-09 09:42:34


On Dec 8, 2004, at 5:45 AM, Roderick Johnstone wrote:

> We've been using LAM for a little over a year.
>
> I've just updated the nodes we run lam on to Fedora Core 2 and lam to
> 7.1.1, and we are now seeing a problem which was not present before.
> We run these diskless nodes with an nfs mounted root and /tmp. To
> provoke the problem:
>
> 1) lamboot a universe with say 4 nodes
> 2) lamhalt
> 3) Each node in the lam universe now has tkill running at 100% cpu
>
> I can run MPI jobs between 1) and 2) fine.
> After 2, the lamd seems to be shut down fine.
>
> I've attached an strace to one of the tkill processes and its in a
> tight loop trying to unlink eg .nfs006a081000000345 in the
> /tmp/lam-rmj_at_blah directory.

Hi -

I think we found the problem - tkill was inheriting that file
descriptor from the lamd that started it. I've added a fix to SVN.
Can you try our nightly tarball and see if the problem persists?

   http://www.lam-mpi.org/svn/

Thanks,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/