LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Roderick Johnstone (rmj_at_[hidden])
Date: 2004-12-11 11:56:39


All,

I think it would be fairer to say that Brian fixed it, and did so in a
very timely manner.

Many thanks for the quick response on this issue.

Roderick

On Sat, 11 Dec 2004, Jeff Squyres wrote:

> Brian and Roderick iterated about this off-list and have solved the problem.
>
> I've uploaded a new 7.1.2 beta with the fixes.
>
> http://www.lam-mpi.org/beta/
>
>
> On Dec 9, 2004, at 11:07 AM, Roderick Johnstone wrote:
>
>> Brian Barrett wrote:
>>> On Dec 8, 2004, at 5:45 AM, Roderick Johnstone wrote:
>>>> We've been using LAM for a little over a year.
>>>>
>>>> I've just updated the nodes we run lam on to Fedora Core 2 and lam to
>>>> 7.1.1, and we are now seeing a problem which was not present before. We
>>>> run these diskless nodes with an nfs mounted root and /tmp. To provoke
>>>> the problem:
>>>>
>>>> 1) lamboot a universe with say 4 nodes
>>>> 2) lamhalt
>>>> 3) Each node in the lam universe now has tkill running at 100% cpu
>>>>
>>>> I can run MPI jobs between 1) and 2) fine.
>>>> After 2, the lamd seems to be shut down fine.
>>>>
>>>> I've attached an strace to one of the tkill processes and its in a tight
>>>> loop trying to unlink eg .nfs006a081000000345 in the /tmp/lam-rmj_at_blah
>>>> directory.
>>> Hi -
>>> I think we found the problem - tkill was inheriting that file descriptor
>>> from the lamd that started it. I've added a fix to SVN. Can you try our
>>> nightly tarball and see if the problem persists?
>>> http://www.lam-mpi.org/svn/
>>> Thanks,
>>> Brian
>> Brian
>>
>> I'm afraid it doesnt seem to be fixed. strace gives me infinite numbers of:
>>
>> unlink(".nfs006a080c0000005d") = -1 EBUSY (Device or resource
>> busy)
>>
>> Again, this file has the same inode as lam-kernel-socket.
>>
>> This was from the tarball lam-7.2b1r9913.tar.gz.
>>
>> I double/triple checked I'm picking up this new build of lam-mpi.
>>
>> Can you have another look please.
>>
>> Thanks
>>
>> Roderick
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>