LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Roderick Johnstone (rmj_at_[hidden])
Date: 2004-12-09 11:07:38


Brian Barrett wrote:
> On Dec 8, 2004, at 5:45 AM, Roderick Johnstone wrote:
>
>> We've been using LAM for a little over a year.
>>
>> I've just updated the nodes we run lam on to Fedora Core 2 and lam to
>> 7.1.1, and we are now seeing a problem which was not present before.
>> We run these diskless nodes with an nfs mounted root and /tmp. To
>> provoke the problem:
>>
>> 1) lamboot a universe with say 4 nodes
>> 2) lamhalt
>> 3) Each node in the lam universe now has tkill running at 100% cpu
>>
>> I can run MPI jobs between 1) and 2) fine.
>> After 2, the lamd seems to be shut down fine.
>>
>> I've attached an strace to one of the tkill processes and its in a
>> tight loop trying to unlink eg .nfs006a081000000345 in the
>> /tmp/lam-rmj_at_blah directory.
>
>
> Hi -
>
> I think we found the problem - tkill was inheriting that file descriptor
> from the lamd that started it. I've added a fix to SVN. Can you try
> our nightly tarball and see if the problem persists?
>
> http://www.lam-mpi.org/svn/
>
> Thanks,
>
> Brian
>
Brian

I'm afraid it doesnt seem to be fixed. strace gives me infinite numbers of:

unlink(".nfs006a080c0000005d") = -1 EBUSY (Device or resource busy)

Again, this file has the same inode as lam-kernel-socket.

This was from the tarball lam-7.2b1r9913.tar.gz.

I double/triple checked I'm picking up this new build of lam-mpi.

Can you have another look please.

Thanks

Roderick