LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-12-11 07:46:49


Brian and Roderick iterated about this off-list and have solved the
problem.

I've uploaded a new 7.1.2 beta with the fixes.

        http://www.lam-mpi.org/beta/

On Dec 9, 2004, at 11:07 AM, Roderick Johnstone wrote:

> Brian Barrett wrote:
>> On Dec 8, 2004, at 5:45 AM, Roderick Johnstone wrote:
>>> We've been using LAM for a little over a year.
>>>
>>> I've just updated the nodes we run lam on to Fedora Core 2 and lam
>>> to 7.1.1, and we are now seeing a problem which was not present
>>> before. We run these diskless nodes with an nfs mounted root and
>>> /tmp. To provoke the problem:
>>>
>>> 1) lamboot a universe with say 4 nodes
>>> 2) lamhalt
>>> 3) Each node in the lam universe now has tkill running at 100% cpu
>>>
>>> I can run MPI jobs between 1) and 2) fine.
>>> After 2, the lamd seems to be shut down fine.
>>>
>>> I've attached an strace to one of the tkill processes and its in a
>>> tight loop trying to unlink eg .nfs006a081000000345 in the
>>> /tmp/lam-rmj_at_blah directory.
>> Hi -
>> I think we found the problem - tkill was inheriting that file
>> descriptor from the lamd that started it. I've added a fix to SVN.
>> Can you try our nightly tarball and see if the problem persists?
>> http://www.lam-mpi.org/svn/
>> Thanks,
>> Brian
> Brian
>
> I'm afraid it doesnt seem to be fixed. strace gives me infinite
> numbers of:
>
> unlink(".nfs006a080c0000005d") = -1 EBUSY (Device or resource
> busy)
>
> Again, this file has the same inode as lam-kernel-socket.
>
> This was from the tarball lam-7.2b1r9913.tar.gz.
>
> I double/triple checked I'm picking up this new build of lam-mpi.
>
> Can you have another look please.
>
> Thanks
>
> Roderick
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/