On Oct 22, 2004, at 6:53 PM, Warner Yuen wrote:
> I'm currently using lam-7.2b1r9913 because our customer needs the
> Myrinet support. However, when trying to lamboot 128 machines it takes
> about 1/2 hour to complete. Is this normal or is it an issue with the
> particular SVN version that I'm using.
This is probably normal, but we can also probably do better (there's
hope -- keep reading). The speed is a function of a few things:
1. ssh itself is just slow
2. LAM actually does *2* ssh's out to each node -- the first one is to
determine which shell is being used on the remote node (so that we can
ensure to run .profile if necessary)
3. LAM's rsh booting process is a blocking linear process -- it
launches on lamd, waits for it to call back to lamboot, and then
continues on to the next.
We can't really do anything about #1. #2 can be reduced to a single
ssh, however -- if you use the "-b" option to lamboot (or set the SSI
parameter "boot_rsh_fast" to 1 -- either on the mpirun command line or
via environment variable). In this case, LAM will assume that you have
the same shell on the remote machine as you do on the local machine.
It should pretty much cut your lamboot time in half.
So at least that will bring you down to 15 minutes. :-\
You might want to use the "-d" option to lamboot (which will output
copious status messages during the lamboot) and verify that it's
steadily processing along, but that it's just being linear/slow.
> Currently I LAM is installed on an NFS mount. I tested about 16
> machines with LAM mounted locally but it didn't really make boot
> performance any faster.
LAM's lamboot time in the rsh case should be pretty close to linear.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|