LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2004-12-17 10:31:22


On Dec 16, 2004, at 8:46 PM, sean dettrick wrote:

> >From: Robin Humble <rjh_at_[hidden]>
> >Reply-To: General LAM/MPI mailing list <lam_at_[hidden]>
> >To: General LAM/MPI mailing list <lam_at_[hidden]>
> >Subject: Re: LAM: LAM7.1.1, OSX, usysv/sysv failing on some
> nodes,working on others
> >Date: Thu, 16 Dec 2004 18:49:49 -0500
> >
> >On Thu, Dec 16, 2004 at 11:20:23PM +0000, sean dettrick wrote:
> > >I have LAM7.1.1 running on a cluster of dual G5 nodes on OSX.
> > >On some nodes LAM is working perfectly with usysv, sysv, and
> tcp RPI's, but
> > >there are 4 nodes where the usysv and sysv RPI's
> intermittently fail to
> >
> >sounds like your app is crashing (or lamd is being untidy) and
> leaving
> >shared memory areas lying around. once you run out of shared mem
> areas
> >then the app doesn't run.
> >on Linux you can use 'ipcs' to see these, and 'ipcrm' to delete
> them.
> >I presume there's something similat in OSX.
>
> Thanks, that seems to have cleared it up.
>
> There's no native ipcs/ipcrm, but I downloaded one as recommended by
> the Open Darwin Ports project (http://darwinports.opendarwin.org/).
> It works nicely. I can see now that the presence of stale
> semaphores/shared memory areas is preventing my usysv jobs from
> working.

OS X sets the default limits for System V shared memory and semaphores
basically right at the minimum requirements of LAM's sysv and usysv
devices. So if anything is in use, those devices fail to run.

There are some situations in which LAM will fail to clean up System V
resources properly (jobs crashing is a usual cause). On Linux, this
isn't a problem because the SysV limits are high enough you can crash
multiple times before you start to see failures. LAM still has a
handle to the resources, it just doesn't know it should free them. At
shutdown, the resources (if it was LAM leaking them on your system)
will be freed automatically. You can also use the "lamclean" command
to tell LAM to release all System V resources in use. Note that
lamclean will also result in all MPI processes running in that set of
daemons to be killed.

The other frequent users of System V resources are the Apache web and
the open source database engines. You might want to make sure that
neither is running on the machines in question.

Finally, you can increase the amount of System V resources available.
There is a page about how to do this for various operating systems (OS
X included) on the PostgreSQL web site:

     http://www.postgresql.org/docs/7.4/interactive/kernel-resources.html

Hope this helps,

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/