LAM/MPI logo

LAM FAQ: Booting LAM

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just the FAQ
Table of contents:
  1. Can I run LAM as root?
  2. What does "booting LAM" mean?
  3. Are there any tutorials available on getting started with LAM/MPI?
  4. Can I run LAM/MPI jobs under Globus?
  5. Can I run LAM/MPI jobs on a BProc cluster?
  6. Can I run LAM/MPI jobs under PBS?
  7. What conditions have to be met for LAM to be booted successfully?
  8. How do I add the LAM executables to my $PATH?
  9. I have more than one NIC on a host. Which IP name/address do I list in the boot schema?
  10. What is the recon tool? What do I use it for?
  11. recon succeeded, but lamboot failed. Why?
  12. What is a .rhosts file? Why do I need it?
  13. Should I use "+" in my .rhosts file?
  14. Can I use ssh with LAM (instead of rsh)?
  15. How do I make ssh not ask me for my password?
  16. recon/lamboot claims that it cannot find LAM executables on the remote node. What does that mean?
  17. Does LAM use static port numbers?
  18. Can I lamboot to hosts outside of my firewall?
  19. lamboot seems to hang -- why? And what do I do?
  20. Can I issue multiple lamboot's on a single machine?
  21. How do I lamboot multi-processor machines?

[ Return to FAQ ]


1. Can I run LAM as root?

No. It is a Very Bad Idea to run LAM as root. LAM will actually explicitly disallow root from running all exectuables except recon (recon is allowed so that sysadmins who are installing LAM can test basic functionality).

The reasons why root should not run LAM executables are almost identical to those listed in the question "Should I run LAM ias a root-level service for all my users to access?" in the "Typical setup of LAM" section.

[ Top of page | Return to FAQ ]


2. What does "booting LAM" mean?

The LAM/MPI environment needs to be "booted" before any user MPI applications can be run.

LAM uses a daemon on each node for process control, meta environment control, and, in some cases, message passing. "Booting LAM" refers to the act of launching these daemons on each node. The lamboot command is used to boot LAM; after a successful lamboot, user programs can be run in the LAM/MPI environment.

Once the user is finished with LAM, the lamhalt command is used to shut down the LAM/MPI environment and remove the daemons from each node. Once the lamhalt command has been successfully run, no more LAM/MPI programs can be invoked until another lamboot is successfully issued.

[ Top of page | Return to FAQ ]


3. Are there any tutorials available on getting started with LAM/MPI?

Yes, there are several. Click on the "tutorials" link in the left-hand navigation.

Here are a few of the tutorials available:

[ Top of page | Return to FAQ ]


4. Can I run LAM/MPI jobs under Globus?
Applies to LAM 7.0 and above

Yes, but in limited scenarios.

LAM/MPI can boot LAM across a Globus grid using the fork scheduler only. Notes about the globus boot SSI module:

  • The globus boot schema will never be selected to run automatically. You must either manually select it (by setting the boot SSI parameter to "globus"), or by elevating its priority so that it is selected over other available SSI boot modules.
  • LAM must be able to find globus-job-run in the local path where lamboot is launched.
  • Since Globus does not run user's "dot" files to setup the environment on the remote nodes, you must specify the location of the LAM executables by specifying the lam_install_path attribute for each host in the boot schema (see example below).
  • If the target Globus host is not local, you must specify the contact name in the boot schema. If the contact name contains spaces, the entire contact name must be enclosed in quotes.

The following is an example boot schema for the globus boot module:

"inky:12853:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/opt/lam cpu=2
"pinky:3245:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/opt/lam cpu=4
"blinky:2345:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/opt/lam cpu=4
"clyde:82342:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/software/lam

Be sure to see the LAM/MPI User's Guide for more details about the globus module.

[ Top of page | Return to FAQ ]


5. Can I run LAM/MPI jobs on a BProc cluster?
Applies to LAM 7.0 and above

Yes.

Ensure that LAM/MPI was compiled and installed with support for the bproc boot SSI module (you can run the laminfo command to see if the bproc boot module is included in your installation). When on a BProc head node, lamboot (etc.) should automatically choose to use the bproc boot module and launch the LAM daemons using native BProc mechanisms.

Notes about the bproc boot SSI module:

  • You still need to have a boot schema (hostfile) to specify what nodes to run on. Although IP hostnames are allowed, it is preferred to use the native BProc node nomenclature of integers to specify nodes. -1 is the head node, and 0, 1, 2, ... are the compute nodes.
  • You must include the head node in the boot schema (-1). Note that LAM will automatically not schedule MPI jobs on the head node when running with "mpirun C my_mpi_program" or "mpirun N my_mpi_program".
  • LAM will refuse to boot on nodes that you do not have permission to run on.

More details about the bproc boot module are available in the LAM/MPI Installation Guide and LAM/MPI User's Guide.

[ Top of page | Return to FAQ ]


6. Can I run LAM/MPI jobs under PBS?
Applies to LAM 7.0 and above

Yes, LAM/MPI can be booted natively in PBS batch jobs (both OpenPBS and PBS Pro).

When used from within a PBS jobs, lamboot (etc.) will use the native PBS Task Managament (TM) interface to launch the LAM daemons on the nodes that were allocated to the job. The tm boot SSI module does this task; use the laminfo command to see if support for the tm module is included in your LAM/MPI installation. Some notes about the tm boot SSI module:

  • Using the tm boot SSI module, PBS is aware of all processes in a job, and therefore provides both accurate accounting and guaranteed cleanup on all nodes in the job.
  • The tm boot module should be selected by default in when run in PBS jobs.
  • There is no need to specify a boot schema to the lamboot command (etc.); the PBS TM interface tells LAM what nodes were allocated and how many VCPUs were allocated on each. If a boot schema is supplied, it is ignored.

Be sure to see the LAM/MPI User's Guide for more information about the tm boot SSI module.

[ Top of page | Return to FAQ ]


7. What conditions have to be met for LAM to be booted successfully?

For each machine that LAM is to be booted on, all of the following conditions must be met:

  • The machine must be reachable and operational.
  • The user must have an account on the machine.
  • The user must be able to rsh(1) (or use whatever remote shell program was defined at configure time, or whatever remote shell program is set in the the LAMRSH environment variable) to the machine (permissions typically must be set in the .rhosts file on the machine for rsh).
  • The LAM executables must be locatable on that machine, using the shell's search path and possibly the LAMHOME environment variable.
  • The user must be able to write to /tmp.
  • The shell's start-up script must not print anything on standard error. The user can take advantage of the fact that rsh/ssh/whatever will start the shell non-interactively. The start-up script can exit early in this case, before executing many commands relevant only to interactive sessions and likely to generate output.
  • All machines must be able to resolve the fully-qualified domain name (FQDN) of all the machines being booted (including itself).

All of these prerequisites must be met before LAM can be booted properly.

NOTE: OSCAR users should already have all of these conditions met already. If you are having a problem with lamboot, check to see that a simple ssh between nodes works properly.

[ Top of page | Return to FAQ ]


8. How do I add the LAM executables to my $PATH?

LAM must be able to find the LAM executables in your $PATH on every node. As such, your configuration/initialization files need to add the LAM executables to your $PATH properly.

How to do this may be highly dependant upon your local configuration, so you may need to consult with your local system administrator. Some system administrators take care of these details for you, some don't. YMMV. Some common examples are included below, however.

You must have at least a minimum understanding of how your shell works to get the LAM executables in your $PATH properly. Note that the LAM executables must be added to your $PATH in two situations: (1) when you login to an interactive shell, (2) and when you login to non-interactive shells on remote nodes.

  • If (1) is not configured properly, executables like mpicc will not be found, and it is typically obvious what is wrong. The LAM executable directory can manually be added to the $PATH, or the user's startup files can be modified such that the LAM executables are added to the $PATH every login. This latter approach is preferred.

    All shells have some kind of script file that is executed at login time to set things like $PATH and perform other environmental setup tasks. This startup file is the one that needs to be edited to add the LAM executables to the $PATH. Consult the manual page for your shell for specific details (some shells are picky about the permissions of the startup file, for example). The table below lists some common shells and the startup files that they read/execute upon login:

    Shell Interactive login startup file
    sh (Bourne shell, or bash named "sh") .profile
    csh .cshrc followed by .login
    tcsh .tcshrc if it exists, .cshrc if it does not, followed by .login
    bash .bash_profile if it exists, or .bash_login if it exists, or .profile if it exists (in that order). Note that some Linux distributions automatically come with .bash_profile scripts for users that automatically execute .bashrc as well. Consult the bash man page for more information.

  • If (2) is not configured properly, executables like lamboot will not function properly, and it can be somewhat confusing to figure out (particularly for bash users).

    The startup files in question here are the ones that are automatically executed for a non-interactive login on a remote node (e.g., "rsh othernode ps"). Note that not all shells support this, and that some shells use different files for this than listed in (1). Some shells will supercede (2) with (1). That is, fulfilling (2) may automatically fulfill (1). The following table lists some common shells and the startup file that is automatically executed, either by LAM or by the shell itself:

    Shell Non-interactive login startup file
    sh (Bourne or bash named sh) This shell does not execute any file automatically, so LAM will execute the .profile script before invoking LAM executables on remote nodes
    csh .cshrc
    tcsh .tcshrc if it exists, or .cshrc if it does not
    bash .bashrc if it exists

NOTE: OSCAR users should already have this step taken care of. OSCAR uses a package called switcher to setup the $PATH for users. You may need to set your personal default to use LAM/MPI if it is not already the system default. Consult the OSCAR User's Manual for more details.

[ Top of page | Return to FAQ ]


9. I have more than one NIC on a host. Which IP name/address do I list in the boot schema?

Two common configurations for setting up clusters for parallel computing are:

  • All the nodes are on a "private" network such that they cannot communicate [directly] with outside networks. One node -- designated as a "master" node -- has two network interface cards (NICs), one of which is connected to the "private" network, and the other is connected to the "public" network. All of the other nodes only have one NIC, which is connected to the private network. So each node has a single IP address, but the master node has two IP addresses.
  • All of the nodes are connected to more than one TCP/IP network. For example, each node has an 10Mbps ethernet NIC and a 100Mbps NIC. Both have a TCP/IP stack. Hence, each node has two IP addresses.

In each case, there's at least one node that has two IP addresses (and potentially two IP names) -- which one should be used in the LAM boot schema?

The answer is to use the IP name/address that refers to the NIC that you want LAM to use for TCP/IP communication (both LAM "meta" information and MPI message passing). LAM will use the NIC associated with the name/address used in the boot schema file. For example, in the first scenario above, the master node should be represented in the boot schema file with the IP address/name of its NIC on the private network. In the second scenario, the IP address/name of each node's 100Mbps NIC should be used to get maximum bandwidth for message passing.

Note that LAM can work fine in the first scenario if you specify the IP name/address of the NIC on the public network if the networking on the master node is configured to route traffic from the private network to the public network (usually behind Network Address Translation, or NAT). This is usually not a good idea, however, because it effectively causes extra network hops for traffic from the slave nodes to the master node, and therefore adds latency to message passing. In most cases, the IP name/address for the NIC on the private network should be used.

Also note that LAM will resolve all IP names only on the node where lamboot is executed. Hence, the local name resolution setup only matters on that node; name resolution does not occur on any other node. Internally, LAM only uses IP addresses.

For non-TCP/IP communication mechanisms, LAM will only use these IP addresses for "meta" information.

[ Top of page | Return to FAQ ]


10. What is the recon tool? What do I use it for?

recon is used to verify that a user has the correct setup to boot LAM properly. It checks to see if LAM can be started on all the nodes in a given boot schema.

Users use recon to check/verify that their shell startup scripts (e.g., .cshrc, .profile, .bashrc, etc.) set the environment properly to ensure that LAM can be started on the local and remote nodes properly.

recon does this by attempting a "fake" boot process on each node in the boot schema. recon will attempt to launch "tkill -N" on each node (the -N option indicates that tkill should not do anything).

If "tkill -N" can be executed successfully on each node, the following has been verified:

  • The user can execute commands on a remote machine
  • The LAM executables can be found on the remote node
  • The LAM executables can be executed on the remote node

Note that this does not guarantee that lamboot will function properly; it only gives a pretty good indication that it will. lamboot can still fail for other reasons.

[ Top of page | Return to FAQ ]


11. recon succeeded, but lamboot failed. Why?

There can be many reasons.

Note that recon does not do everything that lamboot, which is why it is only a pretty good test, not a conclusive test. lamboot can sometimes fail with not-particularly-helpful error messages (particularly in LAM versions prior to 7.0).

A common cause for lamboot failure is that one of the hostnames in the boot schema resolved to the address 127.0.0.1. This is fine when there is only one hostname involved (i.e., lambooting on a single machine). However, when the LAM universe consists of more than one machine, none of the hostnames can resolve to the address 127.0.0.1. This is because 127.0.0.1 is a "special" IP address that always maps back to the local machine -- it's the localhost address. So if a node in the LAM universe tries to use the 127.0.0.1 address to try to contact another node in the LAM universe, it will actually be opening a socket to itself, not the intended destination node. And the connection will therefore fail.

Hence, all hostnames in the boot schema must resolve to the IP address of the network interface card (NIC) that you wish LAM to use.

You can tell if "the 127.0.0.1 problem" is happening to you if you lamboot with the -d switch -- see if any of the hboot lines in the debugging output show 127.0.0.1.

Unfortunately, some Linux distributions automatically put the hostname of the machine on the same line as localhost in /etc/hosts. For example, consider the following /etc/hosts file that is on the machine blinky, which is the "master" node in a cluster. blinky has a single NIC, with IP address 192.168.1.10:

127.0.0.1     localhost blinky
192.168.1.10  masternode.example.com masternode
192.168.1.11  node1.example.com node1
1921.68.1.11  node2.example.com node2

If the name "blinky" is used in a boot schema with other hosts, the lamboot will fail. The following solutions are available:

  • Use the name "masternode" in the LAM boot schema instead of "blinky". This is probably the easiest and safest solution.
  • Move the name "blinky to the same line as "masternode".
  • Use all IP addresses instead of names in the boot schema file.

NOTE: Starting with LAM 7.0, LAM will detect this situation and give an error immediately rather than trying to boot and failing. Versions prior to 7.0 will try to boot and abort with amorphous, undescriptive error messages.

[ Top of page | Return to FAQ ]


12. What is a .rhosts file? Why do I need it?

If you are using rsh to launch processes on remote nodes (either by setting this at configure time, letting configure use the default value of "rsh", or by setting the LAMRSH environment variable when you invoke recon or lamboot), you will probably need to have a $HOME/.rhosts file.

This file allows you to execute commands on remote nodes without being prompted for a password. The permissions on this file usually must be 0644 (rw-r--r--). It must exist in your home directory on every node that you plan to use LAM with.

Each line in the .rhosts file indicates a machine and user that programs may be launched from. For example, if the user steve wishes to launch programs from the machine stevemachine to the machines alpha, beta, and gamma, there must be a .rhosts file on each of the three remote machines (alpha, beta, and gamma) with at least the following line in it:

stevemachine steve

The first field indicates the name of the machine where jobs may originate from; the second field indicates the user ID who may originate jobs from that machine. It is better to supply a fully-qualified domain name for the machine name (for security reasons -- there may be many machines named stevemachine on the internet). So the above example should be:

stevemachine.example.com steve

The LAM Team strongly discourages the use of "+" in the .rhosts file. This is always a huge security hole.

If rsh does not find a matching line in the $HOME/.rhosts file, it will prompt you for a password. LAM requires the password-less execution of commands; if rsh prompts for a password, lamboot and recon will fail.

NOTE: Some implementations of rsh are very picky about the format of text in the .rhosts file. In particular, some do not allow leading white space on each line in the .rhosts file, and will give a misleading "permission denied" error if you have white space before the machine name.

NOTE: It should be noted that rsh is not considered "secure" or "safe" -- .rhosts authentication is considered fairly weak. The LAM Team recommends that you use ssh ("Secure Shell") to launch remote programs, as it uses a much stronger authentication system.

NOTE: OSCAR users should not need .rhosts files. OSCAR is configured to automatically use user-level passwordless-ssh between all nodes in the cluster.

[ Top of page | Return to FAQ ]


13. Should I use "+" in my .rhosts file?

No!

While there are a very small number of cases where using "+" in your .rhosts file may be acceptable, the LAM Team highly recommends that you do not.

Using a "+" in your .rhosts file indicates that you will allow any machine and/or any user to connect as you. This is extremely dangerous, especially on machines that are connected to the internet. Consider the fact that anyone on the internet can connect to your machine (as you) -- it should strike fear into your heart.

The + should not be used for either field of the .rhosts file.

Instead, you should use the full and proper hostname and username of accounts that are authorized to remotely login as you to that machine (or machines). This is usually just a list of your own username on a list of machines that you wish to run LAM over. See the "What is a .rhosts file? Why do I need it?" question for further explanation, as well as your local rsh documentation.

Additionally, the LAM Team strongly recommends that rsh is not used -- it is considered weak remote authentication. Instead, we recommend the use of ssh -- the secure remote shell. See the questions "Can I use ssh with LAM?" and "How do I make ssh not ask for me for my password?" for more details.

[ Top of page | Return to FAQ ]


14. Can I use ssh with LAM (instead of rsh)?

Yes, you can change the remote transport agent that LAM uses to spawn the LAM daemons. While rsh is the default, it can be changed to other agents, such as ssh. ssh is a popular choice because of the added security that it provides over the .rhosts security provided by rsh. And since ssh can pass AFS tokens, it presents an attractive, highly secure, yet fully-AFS-authenticated method, for invoking LAM.

If you choose to use ssh, the 1.x series of ssh may require the use of the "-x" command line flag. "-x" prevents X forwarding, which may prevent an xauth status message from being printed on stderr. lamboot/recon/etc. interprets information on stderr to mean that a remote invocation has failed; ssh's "-x" may prevent this. The "-p" option may also be useful for suppressing stderr output; see the ssh documentation.

You can specify to use ssh at configure time with the --with-rsh flag:

% ./configure --with-rsh="ssh -x"

Additionaly, in LAM 7.1.1, you can override the remote shell agent that was specified at configure with the LAMRSH environment variable. Setting this environment variable before invoking recon, lamboot, or any other LAM executable will force LAM to use that remote shell program instead. For example, using a Bourne shell (or some other sh derrivative):

% LAMRSH="ssh -x"
% export LAMRSH
% recon myhostfile

Or, using the C shell (or some csh derrivative):

% setenv LAMRSH "ssh -x"
% recon myhostfile

NOTE: OSCAR users typically already have LAM setup to use ssh by default.

[ Top of page | Return to FAQ ]


15. How do I make ssh not ask me for my password?

There are multiple ways.

Note that there are two mainstream versions of ssh. One is the freeware package OpenSSH; the other is SSH, a commercial package from SSH Communications Security Corp.

This documentation provides an overview for using user keys and the OpenSSH 2.x key management agent (if your OpenSSH only supports 1.x key management, you should upgrade). See the OpenSSH documentation for more details and a more thorough description. The process is essentially the same for the commercial SSH, but the command names and filenames are slightly different. Consult the SSH documentation for more details.

References to ssh in this text refer to OpenSSH.

Normally, when you use ssh to connect to a remote host, it will prompt you for your password. However, in order for lamboot and recon to work properly, you need to be able to execute jobs on remote nodes without typing in a password. In order to do this, you will need to set up RSA (ssh 1.x and 2.x) or DSA (ssh 2.x) authentication. We recomend using DSA authentication as it is generally \"better\" (i.e., more secure) than RSA authentication. As such, this text will describe the process for DSA setup -- RSA setup is analogous, but takes slightly different commands and filenames.

This text will briefly show you the steps involved in doing this, but the ssh documentation is authorative on these matters should be consulted for more information.

The first thing that you need to do is generate an DSA key pair to use with ssh-keygen:

% ssh-keygen -t dsa

Accept the default value for the file in which to store the key ($HOME/.ssh/id_dsa) and enter a passphrase for your keypair. You may choose to not enter a passphrase and therefore obviate the need for using the ssh-agent. However, this weakens the authentication that is possible, because your secret key is [potentially] vulnerable to compromise because it is unencrypted. See the ssh documentation.

Next, copy the $HOME/.ssh/id_dsa.pub file generated by ssh-keygen to $HOME/.ssh/authorized_keys:

% cd $HOME/.ssh
% cp id_dsa.pub authorized_keys

In order for DSA authentication to work, you need to have the $HOME/.ssh directory in your home directory on all the machines you are running LAM on. If your home directory is on a common filesystem, this is already taken care of. If not, you will need to copy the $HOME/.ssh directory to your home directory on all LAM nodes (be sure to do this in a secure manner -- perhaps using the scp command), particularly if your secret key is not encrypted).

ssh is very particular about file permissions. Ensure that your home directory on all your machines is set to mode 755, your $HOME/.ssh directory is also set to mode 755, and that the following files inside $HOME/.ssh have the following permissions:

-rw-r--r--  authorized_keys
-rw-------  id_dsa
-rw-r--r--  id_dsa.pub
-rw-r--r--  known_hosts

You are now set up to use DSA authentication. However, when you ssh to a remote host, you will still be asked for your DSA passphrase (as opposed to your normal password). This is where the ssh-agent program comes in. It allows you to type in your DSA passphrase once, and then have all successive invocations of ssh automatically authenticate you against the remote host. To start up the ssh-agent, type:

% eval `ssh-agent`

You will probably want to start the ssh-agent before you start X, so that all your windows will inherit the environment variables set by this command. Note that some sites invoke ssh-agent for each user upon login automatically; be sure to check and see if there is an ssh-agent running for you already.

Once the ssh-agent is running, you can tell it your passphrase by running the ssh-add command:

% ssh-add $HOME/.ssh/id_dsa

At this point, if you ssh to a remote host that has the same $HOME/.ssh directory as your local one, you should not be prompted for a password. If you are, a common problem is that the permissions in your $HOME/.ssh directory are not as they should be.

Note that this text has covered the ssh commands in very little detail. Please consult the ssh documentation for more information.

NOTE: OSCAR users should already have passwordless-ssh setup, and should not need to perform any of the above steps.

[ Top of page | Return to FAQ ]


16. recon/lamboot claims that it cannot find LAM executables on the remote node. What does that mean?

When recon or lamboot cannot find the LAM executables on a remote node, it means that LAM tried to invoke a LAM executable on the remote node, and the shell failed to find it. This usually indicates that the directory where the LAM executables are found is not in the user's path.

That is, in the user's $HOME/.cshrc (not the user's $HOME/.login!), $HOME/.profile, $HOME/.bashrc, or whatever other shell startup script is used, the directory for the LAM executables must be put in the path environment variable.

Sometimes the directory is put in the path properly, but after the startup script has exited for non-interactive shells. That is, users typically put the extra path statement at the end of their .cshrc (or whatever) file -- this may not be the Right Thing to do.

If your .cshrc file has a line similar to the following:

if ($?USER == 0 || $?prompt == 0) exit

then you must set the path before this line.

[ Top of page | Return to FAQ ]


17. Does LAM use static port numbers?

No. The lamboot command sets up sockets between all nodes in the system. The sockets that are used, and the port numbers that are used to connect these sockets are completely dynamic.

Similarly, when MPI_INIT is invoked in user programs, additional sockets may be setup. These sockets, and the port numbers that are used to connect them are also completely dynamic.

This may be changed in a future release if enough users ask for static port numbers.

[ Top of page | Return to FAQ ]


18. Can I lamboot to hosts outside of my firewall?

Since LAM does not use static port numbers, it would be very difficult to map predictable holes through a firewall to allow LAM to boot properly. Additionally, in C2C mode, user MPI programs will establish futher dynamic sockets.

Until LAM supports static socket numbers, launching LAM jobs through a firewall is highly unlikely.

[ Top of page | Return to FAQ ]


19. lamboot seems to hang -- why? And what do I do?

If lamboot seems to hang for no discernable reason, use the -d switch to either recon or lamboot. This will provide a lot of information on exactly what LAM is trying to do at each step of the way.

The -d switch also sends a lot of debugging output to the system logs (syslog) from the LAM daemons on each node. This output can also be quite helpful in finding problems. The system logs are typically located in directories such as /var/adm or /var/log, but you system's setup may be different.

[ Top of page | Return to FAQ ]


20. Can I issue multiple lamboot's on a single machine?

While there is nothing to prevent you from executing lamboot multiple times on the same host, it probably does not do what you expect. lamboot will kill any running MPI programs and any pre-existing LAM daemon by the same user on a given node before starting up a new LAM daemon.

That is, in most cases, there can only be one LAM daemon per user on a node at any given time -- and this is usually sufficient for most users. It is a common misconception that you need multiple LAM environments to run multiple user MPI programs simultaneously. This is not true -- you can have a single LAM/MPI environment booted, and run multiple user MPI programs in the same environment (even on the same nodes).

Exceptions to this are when running under a batch queueing system -- the batch scheduler may schedule multiple jobs by the same user to the same node. In this case, there clearly needs to be multiple LAM daemons owned by the same user on the same node.

LAM 7.1.1 will automatically do the Right Thing for lamboot's executed inside of PBS, SGE, and LSF batch jobs. That is, if LAM detects that it is running in a PBS job, it will automatically adapt itself to allow one LAM daemon per PBS / SGE / LSF job (vs. the default behavior of one LAM daemon per node), even if PBS / SGE / LSF jobs overlap nodes.

Users of any other batch system may manually set this behavior with LAM 7.1.1 in their batch script files by insertting the following lines before the lamboot command (for sh-related shells):

  LAM_MPI_SESSION_SUFFIX="${BATCH_JOBID}"
  export LAM_MPI_SESSION_SUFFIX
  # ...rest of script, to include the lamboot command

and for csh-related shells:

  setenv LAM_MPI_SESSION_SUFFIX "${BATCH_JOBID}"
  # ...rest of script, to include the lamboot command

where users of other batch systems would use an appropriate environment variable that gives the batch job ID instead of $BATCH_JOBID. Consult your batch system's documentation.

[ Top of page | Return to FAQ ]


21. How do I lamboot multi-processor machines?

lamboot has been extended to understand multiple CPUs on a single host, and is intended to be used in conjunction with the new "C" mpirun syntax for running on SMP machines (see the section on mpirun). Multiple CPUs can be indicated in two ways: list a hostname multiple times, or add a "cpu=N" phrase to the host line (where "N" is the number of CPUs available on that host). For example, the following hostfile:

        blinky
        blinky
        blinky
        blinky
        pinky cpu=2
indicates that there are four CPUs available on the "blinky" host, and that there are two CPUs available on the "pinky" host. Note that this works nicely in a PBS environment, because PBS will list a host multiple times when multiple vnodes on a single node have been allocated by the scheduler.

It is important to note that LAM has no concept of CPU scheduling issues -- that is the operating system's responsibility. Specifying "cpu=M" or listing a hostname multiple times in a boot schema file is simply shorthand for indicating to LAM how many processes you will want to launch on a given machine. In the above example, if the machine blinky really only has two processors (instead of four, as it is listed), LAM will still launch four user processes (See the "Running LAM/MPI applications" section of the FAQ) on blinky because it was listed this way in the boot schema. The operating system is responsible for scheduling those four processes between blinky's two CPUs.

Note that different usernames can be specified for specific hosts as well. For example:

       blinky cpu=2 user=lamguest
specifies that the username "lamguest" should be used to login to the machine "blinky". This is different than previous syntax for specifying usernames for remote nodes; the old use (not even described here :-) is still available, but its use is depricated.

[ Top of page | Return to FAQ ]