On May 11, 2005, at 11:51 AM, Lily Li wrote:
> We do plan to use LAM 7.1.1 for our next release. But the current
> production are using
> LAM 7.0, and it is not easy for us to have a quick switch for many
> processing centers.
Ok.
> For the last week or so, we've been watching the behavior of lamd. It
> seems the fault tolerance
> mode of lamboot make things worse. I'd really appreciate if you could
> give me more information
> regarding lamd to find the root of the problem. We need some
> get-around solutions badly for now.
Note that there was a critical bug fix in the lamd in 7.0.3 for the
timeouts in the -x mode for the lamd -- we accidentally had the
timeouts specified in milliseconds instead of seconds. This could lead
to flooding the network with UDP packets unnecessarily, and *could*
actually be what you are seeing...? (especially with large numbers of
nodes)
It's a trivial fix, if you want to stay with 7.0, you can re-configure
and recompile LAM, but use the configure parameter (there is
unfortunately no run-time way to change this behavior in the lamd):
--with-lamd-ack=500000
(the current default is 1, as in 1 microsecond -- a mistake in units;
it was supposed to be 1 second)
I'm guessing that increasing the fault timeout from 1 microsecond to a
half a second will likely help things a lot, and may resolve your
problem. This would definitely account for "in -x mode, it's worse" --
this timeout is a problem in both fault and non-fault modes, but there
are more messages flowing in the fault mode, and therefore you are more
likely to violate the timeout (and therefore cause lots more messages
because of the too-short timeout, which will cause more UDP droppage...
it's a downward spiral from there).
> ( our lamboot command is: lamboot -d -x host_list_file, and using
> TCP ssi rpi)
>
> 1. For every lamnodes command, does lamd do a up-to-date synch/check
> on all nodes in LAM,
> or just report the current routing table in memory ?
It just queries the local lamd and reports that lamd's current routing
table.
> So if a lamd died after the last heart beat checking from the head
> node, will the lamnodes on the head node
> show this node as invalid ?
Correct.
> 2. When lamd does the heart beat checking on other nodes, will it
> still be able to respond to client requests
> such as lamnodes ? I mean: will the lamd is in blocking mode ?
Yes and no. The lamd is a single threaded beast, but it has a definite
event loop and never blocks.
> 3. When a lamd does find a invalid node, will it share the
> information with other live nodes ? so that the result from
> lamnodes on all live nodes are the same ?
It's been a while, but I'm pretty sure that each lamd will just find
out on its own that a given lamd is down.
> 4. We do find that when mpirun starts a job, and the tasks in this
> job crash due to I/O problem before it calls MPI_Init(), the lamd
> on the mpirun node will mark lots of other nodes as invalid. Does
> this have anything to do with the fault tolerance mode.
> Our mpirun uses options like "-f -w -ssi rpi tcp", sometime, we
> need to export $LD_PRELOAD.
This *shouldn't* be related. But keep in mind that an mpirun will
cause at least one (UDP) message to each lamd, and the timeout problems
may come into play here.
> 5. From the documentation, the default mpirun uses "-c2c" option. So
> after the job starts, is there still connections
> between the task and the lamd on the same node ? If for some
> reason lamd crashes on some nodes in a job,
> will the job continue to run to finish ?
-c2c is obsolete -- it is now simply the distinction between the lamd
RPI and non-lamd RPIs. So the default is now that MPI jobs will use
the "best" RPI module (e.g., gm, if it's available).
But yes, all individual MPI processes will maintain a connection to
their local lamd from MPI_INIT to MPI_FINALIZE. If the lamd dies,
> 6. The last thing we can do is to use the LAM_MPI_SESSION_SUFFIX env
> to have each job boot its own LAM.
> Is there any drawbacks with this option regarding performance,
> stability ?
If you are running into this timeout problem (and I really think you
are), having multiple universes will definitely increase the UDP
traffic on your network (particularly in -x mode), and potentially lead
to more dropped packets / network flooding.
> Hopefully, these are not hard questions for you, and I'd like to say
> that you guys are great in supporting LAM.
> Will the new Open MPI have the similar support level as LAM ?
If you mean "answering questions on a public mailing list", then yes.
:-)
The lists already exist (see http://www.open-mpi.org), but since we
haven't released the software yet, there's very little traffic. ;-)
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|