LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-04-22 13:13:15


On Apr 21, 2006, at 1:12 PM, Christopher Porter wrote:

> I'm working on an application integration between LAM-MPI, LSF, and
> an EDA applicaiton called "Encounter" from Cadence.
>
> The customer for this integration insists on preventing his users
> from accessing computational hosts (in an effort to try and force
> the users into using LSF so that usage metrics pulled from LSF are
> accurate). This of course causes headaches with the required
> lamboot process. I have been able to configure a build of LAM MPI
> v7.1.2 which I can redirect using the environment variable
> LAMRSH="lsgrun -m" and a few of the "-ssi" boot parameters:
>
> lamboot -v -ssi boot_rsh_no_n 1 -ssi boot_rsh_fast 1 -ssi
> boot_rsh_no_profile 1 ~/lam.schema
>
> While perusing the documentation (the v7.1.2 installation guide
> specifically) I ran accross an intriguing note about "promiscuous
> mode". Pages 31,32 of the guide refer to building the library with
> the flag "--with-boot-promisc" for the cases where LAM can't know
> which hosts will be connecting during the boot process. I've
> searched the rest of the install guide, the user guide, and the
> archives of this mailing list to try and learn more about what
> "promiscuous mode" really does and if it would be helpful to me in
> my situation.
>
> I built a version of the library with this switch (the config.log
> file included) enabled to see if I could discern a difference. I
> tried lambooting daemons as one user and running an mpi application
> as another user (in hopes promiscuity would allow the daemon to
> connect to my mpirun process) but that fails the same as it does
> without the --with-boot-promisc enabled.
>
> So if anyone in the community has suggestions for:
> 1) Getting LAM booted (preferably v6.5.9 rather than 7.1.2 but
> we'll take what we can get) in a no-login environment

I think you're stuck with LAM/MPI 7.0 or later -- 6.5.9 is pretty
much an rsh/ssh only type thing. We didn't start looking at batch
system support until the 7.0 release

> 2) Getting additional information about what promiscuous mode is
> and does

During the lamboot process, lamboot starts lam daemons on a set list
of hosts, then waits for the deamons to "call back" with their
contact information (hostname and a UDP port), then sends the full
contact information list back to all the daemons. By default, the
lamboot process will not accept contact information from any host not
in that initial startup list, which can have some problems in
environments where you might not know the proper IP of the nodes that
will be calling in until the daemons are actually up and running.
Promiscuous mode means that the lamboot process will accept the N
connections from lam deamons, regardless of host. So if the LSF
starter isn't going to know where the daemons end up until they are
actually running, promiscuous mode is what you want.

Users are a different issue - it's always assumed that the same user
will start the lam universe and run jobs within the universe (indeed
- jobs started by lam commands will always run as the user that
executed lamboot). There isn't much of a way around this issue, as
it's assumed throughout the code base.

One thing that can frequently cause problems for batch schedules with
our rsh/ssh starter is that we intentionally do some things to escape
the session ssh starts for us on the remote node. The exact sequence
is something like:

   1) lamboot fork / execs "ssh hostname hboot <options>"
   2) hboot finds lamd executable, does some minor error checking
   3) hboot calls setsid() to exit session
   4) hboot forks
   5) hboot child process closes stdin,stderr,stdout and calls setsid()
   6) hboot child process execs lamd
   7) hboot parent process exits

Depending on the batch scheduler, the parent hboot exiting or the
call to setsid() can be problematic. For PBS, for example, we don't
use hboot - lamboot calls tm_spawn to start the lamd processes
directly on the remote nodes, so that the batch scheduler doesn't
"lose" the process tree.

If you have any questions about any of this, feel free to ask.

Hope this helps,

Brian