Hi, Jeff,
Just a followup on this lamd case.
We reconfigured our LAM 7.0 and install it to our production systems.
It does help a lot. The hang rate dropped dramatically. Although, the
lamd
still lost/crashes sometime ( when my mpi tasks got signal and
exit/killed,
the lamd sometime crashes).
We now have a new problem with LAM. Our production decides to
use CentOS 4 with kernel 2.6.9 instead of RedHat 9. Can LAM 7.0
compiled on RedHat 9 (kernel 2.4) be run on CentOS 4 with kernel
2.6.9 ?
The test runs show that the LAM can work for couple of days, but later
for some reason,
although the lamd is up and running, the lamnodes says the current node
is not in the list.
It seems that the lamd takes the lamnodes command, but it marked itself
as invalid.
Any suggestions?
Regards,
Lily Li
On Wed, 2005-05-11 at 17:44, Lily Li wrote:
> Hi, Jeff,
>
> I am so glad to get such a quick response. We've decided to
> reconfigure our
> LAM 7.0 with timeout of 1 second and let you know if it solves our
> problem.
> ( It may take weeks to get response from production though. )
>
> Thanks again.
>
> Lily Li
>
>
> On Wed, 2005-05-11 at 15:37, Jeff Squyres wrote:
>
> > On May 11, 2005, at 11:51 AM, Lily Li wrote:
> >
> > > We do plan to use LAM 7.1.1 for our next release. But the current
> > > production are using
> > > LAM 7.0, and it is not easy for us to have a quick switch for many
> > > processing centers.
> >
> > Ok.
> >
> > > For the last week or so, we've been watching the behavior of lamd. It
> > > seems the fault tolerance
> > > mode of lamboot make things worse. I'd really appreciate if you could
> > > give me more information
> > > regarding lamd to find the root of the problem. We need some
> > > get-around solutions badly for now.
> >
> > Note that there was a critical bug fix in the lamd in 7.0.3 for the
> > timeouts in the -x mode for the lamd -- we accidentally had the
> > timeouts specified in milliseconds instead of seconds. This could lead
> > to flooding the network with UDP packets unnecessarily, and *could*
> > actually be what you are seeing...? (especially with large numbers of
> > nodes)
> >
> > It's a trivial fix, if you want to stay with 7.0, you can re-configure
> > and recompile LAM, but use the configure parameter (there is
> > unfortunately no run-time way to change this behavior in the lamd):
> >
> > --with-lamd-ack=500000
> >
> > (the current default is 1, as in 1 microsecond -- a mistake in units;
> > it was supposed to be 1 second)
> >
> > I'm guessing that increasing the fault timeout from 1 microsecond to a
> > half a second will likely help things a lot, and may resolve your
> > problem. This would definitely account for "in -x mode, it's worse" --
> > this timeout is a problem in both fault and non-fault modes, but there
> > are more messages flowing in the fault mode, and therefore you are more
> > likely to violate the timeout (and therefore cause lots more messages
> > because of the too-short timeout, which will cause more UDP droppage...
> > it's a downward spiral from there).
> >
> > > ( our lamboot command is: lamboot -d -x host_list_file, and using
> > > TCP ssi rpi)
> > >
> > > 1. For every lamnodes command, does lamd do a up-to-date synch/check
> > > on all nodes in LAM,
> > > or just report the current routing table in memory ?
> >
> > It just queries the local lamd and reports that lamd's current routing
> > table.
> >
> > > So if a lamd died after the last heart beat checking from the head
> > > node, will the lamnodes on the head node
> > > show this node as invalid ?
> >
> > Correct.
> >
> > > 2. When lamd does the heart beat checking on other nodes, will it
> > > still be able to respond to client requests
> > > such as lamnodes ? I mean: will the lamd is in blocking mode ?
> >
> > Yes and no. The lamd is a single threaded beast, but it has a definite
> > event loop and never blocks.
> >
> > > 3. When a lamd does find a invalid node, will it share the
> > > information with other live nodes ? so that the result from
> > > lamnodes on all live nodes are the same ?
> >
> > It's been a while, but I'm pretty sure that each lamd will just find
> > out on its own that a given lamd is down.
> >
> > > 4. We do find that when mpirun starts a job, and the tasks in this
> > > job crash due to I/O problem before it calls MPI_Init(), the lamd
> > > on the mpirun node will mark lots of other nodes as invalid. Does
> > > this have anything to do with the fault tolerance mode.
> > > Our mpirun uses options like "-f -w -ssi rpi tcp", sometime, we
> > > need to export $LD_PRELOAD.
> >
> > This *shouldn't* be related. But keep in mind that an mpirun will
> > cause at least one (UDP) message to each lamd, and the timeout problems
> > may come into play here.
> >
> > > 5. From the documentation, the default mpirun uses "-c2c" option. So
> > > after the job starts, is there still connections
> > > between the task and the lamd on the same node ? If for some
> > > reason lamd crashes on some nodes in a job,
> > > will the job continue to run to finish ?
> >
> > -c2c is obsolete -- it is now simply the distinction between the lamd
> > RPI and non-lamd RPIs. So the default is now that MPI jobs will use
> > the "best" RPI module (e.g., gm, if it's available).
> >
> > But yes, all individual MPI processes will maintain a connection to
> > their local lamd from MPI_INIT to MPI_FINALIZE. If the lamd dies,
> >
> > > 6. The last thing we can do is to use the LAM_MPI_SESSION_SUFFIX env
> > > to have each job boot its own LAM.
> > > Is there any drawbacks with this option regarding performance,
> > > stability ?
> >
> > If you are running into this timeout problem (and I really think you
> > are), having multiple universes will definitely increase the UDP
> > traffic on your network (particularly in -x mode), and potentially lead
> > to more dropped packets / network flooding.
> >
> > > Hopefully, these are not hard questions for you, and I'd like to say
> > > that you guys are great in supporting LAM.
> > > Will the new Open MPI have the similar support level as LAM ?
> >
> > If you mean "answering questions on a public mailing list", then yes.
> > :-)
> >
> > The lists already exist (see http://www.open-mpi.org), but since we
> > haven't released the software yet, there's very little traffic. ;-)
|