I appologize for the slow reply - we've been way behind while working on
releasing LAM/MPI 7.0.
There is an issue in the LAM daemon where the dlo_inet does not properly
back off sending when packets are not acked - this causes a UDP flood that
can overwhelm the lamd and cause it to hang (sometimes even hanging the
machine at the same time, if you have a really fast network). There is
also a subtle file descriptor leak in the lam daemon for jobs that do not
exit cleanly.
It is possible there is something else going on - 99.999% of our users do
not run the job load you are talking about. But my thought is that a UDP
flood could cause the problems you are seeing.
Brian
On Fri, 4 Apr 2003, Phil Ehrens wrote:
> Thank you Manish, your reply gave me warm fuzzies.
>
> Could you be more specific about the bugs? I am sorry
> to be pushy, but we have a release scheduled at about
> the same time as your 7.0 release, and we may need to
> make our own temporary patch (or provide patches for
> you) before the 7.0 release is available. We are
> constrained to only release code and packages that we
> have tested on a live system for at least 2 weeks.
>
> Any details would be very much appreciated. We have a
> large community of gravitational wave researchers that
> we are serving now and they are eager to get at several
> terabytes of data that were recently produced by our
> instruments.
>
> Phil
>
> Manish Chablani wrote:
> > Hi,
> >
> > I apologize for the delay in reply.
> >
> > We are aware of couple of bugs (one being file descriptor leak) in lamd
> > and are working to fix it. Both these bugs should be fixed before the 7.0
> > release of LAM (scheduled in approx. 1 months time). I have a strong
> > feeling that these bugs might be the reason for problems you are seeing.
> >
> > Sometimes, in these cases, using the '-x' option to lamboot can offer
> > improved reliability under high load - you might want to try that.
> >
> > hope this helps,
> > Manish Chablani
> > ------------------------------------------------------
> > Graduate Student, CS Department, Indiana University.
> > http://www.cs.indiana.edu/~mchablan
> >
> > LAM/MPI Developer
> > Make today a LAM/MPI day !!!
> > http://www.lam-mpi.org
> > ------------------------------------------------------
> >
> >
> > On Tue, 1 Apr 2003, Phil Ehrens wrote:
> >
> > > Hi All,
> > >
> > > I am observing occasional hung lamd's on my system.
> > >
> > > I *think* that they are caused by calling 'lamhalt'
> > > or 'state' while mpirun is still running... but I
> > > am not certain.
> > >
> > > There are typically many users on the cluster at the
> > > same time, and the lockups *seem* to have a higher
> > > frequency of occurence when more than one user is
> > > running lamboot/lamhalt/state at a single point in
> > > time.
> > >
> > > This problem results in a lockup at a rate of about
> > > once in 30,000 jobs. About once every two weeks on
> > > my sandbox system.
> > >
> > > My questions are:
> > >
> > > 1.) has anyone else observed lam utilities inducing
> > > lockups of lamd's.
> > >
> > > 2.) am I asking for the impossible by expecting to
> > > be able to automate the execution of 100's of
> > > thousands of mpi jobs without having to manually
> > > interact with the system?
> > >
> > > I have managed to run more than 50,000 lam mpi jobs
> > > at a stretch without manual intervention, but I need
> > > to get another order of magnitude to satisfy the
> > > requirements on my system.
> > >
> > > Phil
> > > --
> > > Phil Ehrens <pehrens_at_[hidden]>| Fun stuff:
> > > The LIGO Laboratory, MS 18-34 | http://www.ralphmag.org
> > > California Institute of Technology | http://www.yellow5.com
> > > 1200 East California Blvd. | http://www.total.net/~fishnet/
> > > Pasadena, CA 91125 USA | http://slashdot.org
> > > Phone:(626)395-8518 Fax:(626)793-9744 | http://kame56.homepage.com
> > > _______________________________________________
> > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> > >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|