Hi, Jeff,
I am so glad to get such a quick response. We've decided to reconfigure
our
LAM 7.0 with timeout of 1 second and let you know if it solves our
problem.
( It may take weeks to get response from production though. )
Thanks again.
Lily Li
On Wed, 2005-05-11 at 15:37, Jeff Squyres wrote:
> On May 11, 2005, at 11:51 AM, Lily Li wrote:
>
> > We do plan to use LAM 7.1.1 for our next release. But the current
> > production are using
> > LAM 7.0, and it is not easy for us to have a quick switch for many
> > processing centers.
>
> Ok.
>
> > For the last week or so, we've been watching the behavior of lamd. It
> > seems the fault tolerance
> > mode of lamboot make things worse. I'd really appreciate if you could
> > give me more information
> > regarding lamd to find the root of the problem. We need some
> > get-around solutions badly for now.
>
> Note that there was a critical bug fix in the lamd in 7.0.3 for the
> timeouts in the -x mode for the lamd -- we accidentally had the
> timeouts specified in milliseconds instead of seconds. This could lead
> to flooding the network with UDP packets unnecessarily, and *could*
> actually be what you are seeing...? (especially with large numbers of
> nodes)
>
> It's a trivial fix, if you want to stay with 7.0, you can re-configure
> and recompile LAM, but use the configure parameter (there is
> unfortunately no run-time way to change this behavior in the lamd):
>
> --with-lamd-ack=500000
>
> (the current default is 1, as in 1 microsecond -- a mistake in units;
> it was supposed to be 1 second)
>
> I'm guessing that increasing the fault timeout from 1 microsecond to a
> half a second will likely help things a lot, and may resolve your
> problem. This would definitely account for "in -x mode, it's worse" --
> this timeout is a problem in both fault and non-fault modes, but there
> are more messages flowing in the fault mode, and therefore you are more
> likely to violate the timeout (and therefore cause lots more messages
> because of the too-short timeout, which will cause more UDP droppage...
> it's a downward spiral from there).
>
> > ( our lamboot command is: lamboot -d -x host_list_file, and using
> > TCP ssi rpi)
> >
> > 1. For every lamnodes command, does lamd do a up-to-date synch/check
> > on all nodes in LAM,
> > or just report the current routing table in memory ?
>
> It just queries the local lamd and reports that lamd's current routing
> table.
>
> > So if a lamd died after the last heart beat checking from the head
> > node, will the lamnodes on the head node
> > show this node as invalid ?
>
> Correct.
>
> > 2. When lamd does the heart beat checking on other nodes, will it
> > still be able to respond to client requests
> > such as lamnodes ? I mean: will the lamd is in blocking mode ?
>
> Yes and no. The lamd is a single threaded beast, but it has a definite
> event loop and never blocks.
>
> > 3. When a lamd does find a invalid node, will it share the
> > information with other live nodes ? so that the result from
> > lamnodes on all live nodes are the same ?
>
> It's been a while, but I'm pretty sure that each lamd will just find
> out on its own that a given lamd is down.
>
> > 4. We do find that when mpirun starts a job, and the tasks in this
> > job crash due to I/O problem before it calls MPI_Init(), the lamd
> > on the mpirun node will mark lots of other nodes as invalid. Does
> > this have anything to do with the fault tolerance mode.
> > Our mpirun uses options like "-f -w -ssi rpi tcp", sometime, we
> > need to export $LD_PRELOAD.
>
> This *shouldn't* be related. But keep in mind that an mpirun will
> cause at least one (UDP) message to each lamd, and the timeout problems
> may come into play here.
>
> > 5. From the documentation, the default mpirun uses "-c2c" option. So
> > after the job starts, is there still connections
> > between the task and the lamd on the same node ? If for some
> > reason lamd crashes on some nodes in a job,
> > will the job continue to run to finish ?
>
> -c2c is obsolete -- it is now simply the distinction between the lamd
> RPI and non-lamd RPIs. So the default is now that MPI jobs will use
> the "best" RPI module (e.g., gm, if it's available).
>
> But yes, all individual MPI processes will maintain a connection to
> their local lamd from MPI_INIT to MPI_FINALIZE. If the lamd dies,
>
> > 6. The last thing we can do is to use the LAM_MPI_SESSION_SUFFIX env
> > to have each job boot its own LAM.
> > Is there any drawbacks with this option regarding performance,
> > stability ?
>
> If you are running into this timeout problem (and I really think you
> are), having multiple universes will definitely increase the UDP
> traffic on your network (particularly in -x mode), and potentially lead
> to more dropped packets / network flooding.
>
> > Hopefully, these are not hard questions for you, and I'd like to say
> > that you guys are great in supporting LAM.
> > Will the new Open MPI have the similar support level as LAM ?
>
> If you mean "answering questions on a public mailing list", then yes.
> :-)
>
> The lists already exist (see http://www.open-mpi.org), but since we
> haven't released the software yet, there's very little traffic. ;-)
|