Brian Barrett wrote:
> On Dec 8, 2004, at 5:45 AM, Roderick Johnstone wrote:
>
>> We've been using LAM for a little over a year.
>>
>> I've just updated the nodes we run lam on to Fedora Core 2 and lam to
>> 7.1.1, and we are now seeing a problem which was not present before.
>> We run these diskless nodes with an nfs mounted root and /tmp. To
>> provoke the problem:
>>
>> 1) lamboot a universe with say 4 nodes
>> 2) lamhalt
>> 3) Each node in the lam universe now has tkill running at 100% cpu
>>
>> I can run MPI jobs between 1) and 2) fine.
>> After 2, the lamd seems to be shut down fine.
>>
>> I've attached an strace to one of the tkill processes and its in a
>> tight loop trying to unlink eg .nfs006a081000000345 in the
>> /tmp/lam-rmj_at_blah directory.
>
>
> Hi -
>
> I think we found the problem - tkill was inheriting that file descriptor
> from the lamd that started it. I've added a fix to SVN. Can you try
> our nightly tarball and see if the problem persists?
>
> http://www.lam-mpi.org/svn/
>
> Thanks,
>
> Brian
>
Brian
I'm afraid it doesnt seem to be fixed. strace gives me infinite numbers of:
unlink(".nfs006a080c0000005d") = -1 EBUSY (Device or resource busy)
Again, this file has the same inode as lam-kernel-socket.
This was from the tarball lam-7.2b1r9913.tar.gz.
I double/triple checked I'm picking up this new build of lam-mpi.
Can you have another look please.
Thanks
Roderick
|