On Oct 22, 2004, at 6:53 PM, Warner Yuen wrote:
> I'm currently using lam-7.2b1r9913 because our customer needs the
> Myrinet support. However, when trying to lamboot 128 machines it takes
> about 1/2 hour to complete. Is this normal or is it an issue with the
> particular SVN version that I'm using.
Hmm. This is somewhat odd -- that's about 14 seconds per node, and a
bit slower than I would expect. The speed is a function of a few
things:
1. ssh itself is just slow
2. LAM actually does *2* ssh's out to each node -- the first one is to
determine which shell is being used on the remote node (so that we can
ensure to run .profile if necessary)
3. LAM's rsh booting process is a blocking linear process -- it
launches on lamd, waits for it to call back to lamboot, and then
continues on to the next.
We can't really do anything about #1. #2 can be reduced to a single
ssh, however -- if you use the "-b" option to lamboot (or set the SSI
parameter "boot_rsh_fast" to 1 -- either on the mpirun command line or
via environment variable). In this case, LAM will assume that you have
the same shell on the remote machine as you do on the local machine.
It should pretty much cut your lamboot time in half.
> Currently I LAM is installed on an NFS mount. I tested about 16
> machines with LAM mounted locally but it didn't really make boot
> performance any faster.
LAM's lamboot time in the rsh case should be pretty close to linear.
> Lastly, I like LAM cause I don't seem to hit the ssh limits like I do
> with MPICH-GM, but is there a limit on number of processes that can be
> launched?
In principle, you're only limited by the resources that LAM will
consume. With the gm RPI, LAM will open a single port in each MPI
process (i.e., a single endpoint). For TCP, each MPI process will open
a TCP socket for each other MPI process -- so the resources go up quite
a bit.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|