Table of contents:
- How do I compile my LAM/MPI program?
- How do I change the compilers that
mpicc,
mpic++/mpiCC, and mpif77 use?
- My Fortran MPI program fails to link! Why?
- My C++ MPI program fails to link! Why?
- Can MPI jobs be checkpointed and restarted?
- Does LAM/MPI support Myrinet?
- Does LAM/MPI support Infiniband?
- Can I run multi-process MPI applications on a single machine?
- How do I measure the performance of my parallel program?
- What directory does my LAM/MPI program run in on the remote nodes?
- How does LAM find binaries that are invoked from
mpirun?
- Why doesn't "mpirun -np 4 test" work?
- Can I run multiple LAM/MPI programs simultaneously?
- Can I pass environment variables to my LAM/MPI processes on the remote nodes upon invocation?
mpirun -c and mpirun -np -- what's the
difference?
- What is "psuedo-tty support"? Do I want that?
- Why can't my process read
stdin?
- Why can only rank 0 read from
stdin?
- What is the
lamd RPI module?
- Why would I use the
lamd RPI module (vs. other RPI modules)?
- How do I run LAM/MPI user programs on multi-processor machines?
- Can I mix multi-processor machines with uni-processor machines in a
single LAM/MPI user program run?
- How do I run an MPMD program? More specifically -- how do I start
different binaries on each node?
- How do I
mpirun across a heterogeneous cluster?
- My LAM/MPI process doesn't seem to reach MPI_INIT. Why?
- My LAM/MPI process seems to get "stuck" -- it runs for a while and
then just hangs. Why?
- TCP performance under Linux 2.2.0-2.2.9 just plain sucks! Why?
[ Return to FAQ ]
|
1. How do I compile my LAM/MPI program? |
The mpicc, mpic++/mpiCC, and
mpif77 "wrapper" compilers are provided to compile C,
C++, and Fortran LAM/MPI programs (respectively).
These so-called "wrapper" compilers are provided to insert all the
relevant compiler and linker flags. That is, the directories where
LAM include files and libraries are required to compile/link LAM/MPI
programs. Rather that forcing the user to supply these flags
manually, the "wrapper" compilers simply take all user arguments, pass
them through to the underlying compiler, and add several flags
indicating the location of LAM's include files and libraries, and link
in the relevant libraries.
What this all means is that compiling LAM/MPI programs is very simple:
shell$ mpicc myprogram.c -o myprogram
will compile a C program.
shell$ mpiCC myprogam.CC -o myCPPprogram
shell$ mpif77 myprogram.f -o myFprogram
will compile a C++ and Fortran LAM/MPI program, respectively.
Additionally, the wrapper compilers can be used to produce object
files, which can be linked later:
shell$ mpicc -c foo.c
shell$ mpicc -c bar.c
shell$ mpicc -c baz.c
shell$ mpicc foo.o bar.o baz.o -o myprogram
It is not necessary to add -lmpi to any of the wrapper
compiler commands; this is implicit in all the wrapper compilers when
an executable is to be linked.
[ Top of page | Return to FAQ ]
2. How do I change the compilers that mpicc,
mpic++/mpiCC, and mpif77 use? |
The mpicc, mpic++/mpiCC, and
mpif77 compilers are really "wrapper" compilers to an
underlying compiler. That is, they only add several command line
switches to the underlying compiler for the convenience of the user.
These switches include the relevant LAM directories where the include
and library files reside, the relevant LAM libraries to link in, etc.
As such, the underlying compiler can be selected both at compile
time and via environment variables at run time. The environment
variables LAMHCC, LAMHCP, and
LAMHF77, if defined, override the underlying compiler
that mpicc, mpic++/mpiCC, and
mpif77 (respectively) will invoke.
For example, to override the default C compiler when using a Bourne
shell (or sh derrivate shell):
shell$ LAMHCC=some_other_cc_compiler
shell$ export LAMHCC
shell$ mpicc myfile.c -o myfile
or, when using a C shell (or csh derrivative):
shell% setenv LAMHCC some_other_cc_compiler
shell% mpicc myfile.c -o myfile
A common use for this feature is to change the underlying Fortran
77 compiler to a Fortran 90 compiler for the mpif77
wrapper compiler.
Note that starting with LAM 6.3, it is not necessary to specify the
-lmpi at the end of the compile line. It is still
necessary with previous versions of LAM.
WARNING: It may not be a
Good Idea to change the default compiler set from the one with which
LAM was compiled, particularly the Fortran and C++ compilers (for
Fortran/C++ programs, or user programs that use the Fortran or MPI 2
C++ bindings). This is because different compilers may use different
internal linkage and/or have conflicts with header files and other
system-level muckety-muck.
If you need to change the default compiler, you may wish to first
ensure that you can link together .o files from the two
compilers into a single executable that works properly first.
[ Top of page | Return to FAQ ]
|
3. My Fortran MPI program fails to link! Why? |
When an MPI fortran fails to link, it is usually due to one of two
common problems:
- Not using the LAM
mpif77 wrapper compiler.
- Using a different underlying Fortran compiler than LAM was
compiled with
The LAM Team strongly recommends using the wrapper compiler
mpif77 to compile and link all Fortran MPI programs.
mpif77 adds in any
relevant compiler and/or linker flags to compile and link MPI
programs. Note that this list of flags may be different depending on
how LAM/MPI was configured, so it is not always safe to figure out
what mpif77 is adding and then add those flags to
your own compile/link command line manually (and not use
mpif77).
Additionally, it is almost always important to use the same
underlying Fortran compiler that LAM was compiled with. Although
mpif77 allows the user to change the underlying
Fortran compiler that is invoked, it is typically not a good idea to
do this because different fortran compilers use different
"name-mangling" schemes for their link-time symbols. A common symptom
of this is when you change the underlying Fortran compiler that
mpif77 uses, and then seeing link-time error
messages similar to the following:
pi.o(.text+0x59): undefined reference to `MPI_INIT'
pi.o(.text+0x8a): undefined reference to `MPI_COMM_RANK'
pi.o(.text+0xb6): undefined reference to `MPI_COMM_SIZE'
pi.o(.text+0x5ba): undefined reference to `MPI_FINALIZE'
This typically indicates that LAM was configured and compiled with
with Fortran compiler (that uses one particular name-mangling scheme),
and the user's program was compiled with another underlying fortran
compiler (that uses a different name-mangling scheme).
The solution is to use the same Fortran compiler that LAM was
configured with. If you need to use a different Fortran compiler, you
will need to re-configure and re-install LAM to use that Fortran
compiler. Use the --with-fc switch to
configure.
[ Top of page | Return to FAQ ]
|
4. My C++ MPI program fails to link! Why? |
The common problems here are almost identical to when Fortran MPI
programs fail to link:
- Ensure to use the
mpiCC (or
mpic++) wrapper compiler
- Use the same underlying C++ compiler that LAM was configured /
compiled with
See the question "My Fortran MPI program fails to link! Why?" for
more details.
[ Top of page | Return to FAQ ]
5. Can MPI jobs be checkpointed and restarted?
Applies to LAM 7.0 and above |
Yes. Generally, for an MPI job to be checkpointable:
- The same checkpoint/restart SSI module must be selected on all MPI
processes in the MPI job.
- All SSI modules selected for use in the MPI job must include
support for checkpoint/restart. At the time of this writing,
crtcp is the only RPI SSI module that includes support
for checkpoint/restart. All collective SSI modules support
checkpoint/restart.
- Currently, only MPI-1 jobs can be checkpointed. The behavior of
jobs performing non-local MPI-2 functions (e.g., dynamic functions to
launch new MPI processes) in the presence of checkpoint and restart is
undefined.
- Checkpoints can only occur after all processes in the job invoke
MPI_INIT and before any process invokes
MPI_FINALIZE.
LAM/MPI currently only supports the Berkeley Lab Checkpoint-Restart (BLCR)
system. Support for BLCR must be available in the LAM/MPI
installation (this can be checked with the laminfo command).
Unfortunately, at the time of LAM/MPI's initial 7.0 release, the
BLCR software was not yet available to the general public. Keep
checking the BLCR web page for updates.
See the LAM/MPI User's Guide for more details about checkpointing
and restarting MPI jobs.
[ Top of page | Return to FAQ ]
6. Does LAM/MPI support Myrinet?
Applies to LAM 7.0 and above |
Yes. The gm RPI SSI module provides low latency, high
bandwidth using the native Myrinet GM message passing library. You
can check to see if your LAM/MPI installation has support for native
GM message passing by running the laminfo command.
Unless some other module was selected as the default, the
gm RPI SSI module should select itself as the RPI to be
used if Myrinet hardware is available.
There is no need to specify to LAM which port to use; in most
cases, the gm module will search and find an available
port to use on every node in the MPI job.
Be sure to see the LAM/MPI User's Guide for more details about the
gm RPI SSI module.
[ Top of page | Return to FAQ ]
|
7. Does LAM/MPI support Infiniband? |
Yes. The ib RPI SSI module provides low latency,
high bandwidth of Infiniband networks using the Mellanox Verbs
Interface (VAPI). You can check to see if your LAM/MPI installation
has support for IB message passing by running the laminfo
command.
Be sure to see the LAM/MPI User's Guide for more details about
the ib RPI SSI module.
[ Top of page | Return to FAQ ]
|
8. Can I run multi-process MPI applications on a single machine? |
Yes. This is actually a common way to test parallel applications.
You can run all the processes of your parallel application on a single machine. LAM/
MPI allows you to launch multiple processes on a single machine, regardless of how many
CPUs are actually present.
A common way of doing this is by using the default boot schema that is installed by
LAM/MPI -- it contains a single node: the localhost. If you run lamboot
with no arguments, the default boot schema will be used, and (assuming it hasn't been
replaced), will launch a LAM universe consisting of just your local machine.
Then use the -np option to mpirun to specify the
desired number of processes to launch. For example:
shell$ mpirun -np 4 my_mpi_application
will start 4 instances of my_mpi_application on the local machine.
For more information see the the LAM/MPI User's Guide (including the Quick State
Tutorial), the lamboot(1), mpirun(1), and bhost(5) man pages.
[ Top of page | Return to FAQ ]
|
9. How do I measure the performance of my parallel program? |
In short, the only real meaningful metric of the performance of a
parallel application is the wall clock execution time.
"User", "System", and "CPU" times are generally not useful because
they only contain portions of the overall run-time, and have little
meaning in a parallel application that spans multiple nodes
(especially in heterogeneous situations). The use of wall-clock time
encompasses the entirety of the performance of the parallel
application -- all processes, all I/O, all message passing, etc.
Trying to measure single components of this overall time is difficult
(and usually impossible) since each system has many different sources
of overhead (some less obvious than others).
[ Top of page | Return to FAQ ]
|
10. What directory does my LAM/MPI program run in on the remote nodes? |
The default behavior for mpirun is to change all present
working directories to the directory where mpirun was
launched from. If this directory does not exist on the remote nodes,
the present working directory is set to $HOME.
This behavior can be overridden with the -D or
-wd command line switches to mpirun.
-wd can be used to set an arbitrary working directory.
For example:
shell$ mpirun -wd /home/jshmo/mpi N my_mpi_program
will change the present working directory to
/home/jshmo/mpi (on all nodes), and then attempt to run
the my_mpi_program program. my_mpi_program
must be in the $PATH (which may include ".", i.e.,
/home/jshmo/mpi).
A popular shortcut for mpirun is:
shell$ cd /home/jshmo/mpi
shell$ mpirun N `pwd`/my_mpi_program
although this assumes that
/home/jshmo/mpi/my_mpi_program exists on all nodes.
[ Top of page | Return to FAQ ]
11. How does LAM find binaries that are invoked from mpirun? |
When you mpirun a relative file name, LAM tries to find the application foo in your $PATH on all nodes to execute it. This follows the Unix/shell model of execution. If you mpirun an absolute filename, LAM simply tries to execute that absolute filename on all nodes. That is:
% mpirun C foo
will depend on the user's $PATH on each machine to find foo.
% mpirun C /home/jshmo/mpi/foo
will simply execute /home/jshmo/mpi/foo on all CPUs. The $PATH environment variable is not used in this case.
This model allows users to set the $PATH environment variable properly in their .cshrc, .profile, or other shell startup script to find the right executables for their architecture. That is, when running LAM in a heterogeneous situation, if the user's shell startup script sets the $PATH appropriately on each node, mpirun foo may find different foo executables on each node (which is probably what you want).
For example, if running on a cluster of Sun and HP workstations, if the user's .cshrc sets /home/jshmo/mpi/SUN in the $PATH on Sun machines, and sets /home/jshmo/mpi/HP in the $PATH on HP machines, mpirun foo will find the foo in the SUN directory on the Sun workstations, and find the foo in the HP directory on the HP workstations.
LAM attempts to change to the directory (on the remote nodes) of the same name as the pwd from where mpirun was invoked (unless overridden with the -wd or -D command line options to mpirun -- see the manual page for mpirun(1) for more details). This can affect the $PATH search if "." is in the $PATH.
[ Top of page | Return to FAQ ]
|
12. Why doesn't "mpirun -np 4 test" work? |
If attempting to run a test program named "test" with a
command similar to "mpirun -np 4 test" fails with an
error message similar to "It seems that [at least] one of processes
that was started with mpirun did not invoke MPI_INIT before
quitting...", then you've run into a well-known problem that is not
really an MPI issue.
More often than not, mpirun will find the unix
utility "test" before it finds your MPI program named
"test". This is typically because the unix utility
test can be found early in your path, such as in
/bin/test or /usr/bin/test. See the FAQ
question "How does LAM find binaries that are invoked from mpirun?"
There are some easy solutions to this problem:
- Rename your program to something other than
test
- Use the full pathname in the
mpirun command line,
such as "mpirun -np 4 /home/jshmo/mpi/test" (assuming
that /home/jshmo/mpi/test is a valid executabled on all
nodes)
[ Top of page | Return to FAQ ]
|
13. Can I run multiple LAM/MPI programs simultaneously? |
Yes. Once you lamboot, you can run as many processes as you wish. For example, if you wish to run two different applications on a group of nodes:
% mpirun c0-3 program1
% mpirun c4-7 program2
program1 will be run on the first four CPUs, and program2 will be run on the last four CPUs. Neither program will interfere with each other; LAM guarantees that no messages from either application will overlap.
There is no need to issue a second lamboot.
[ Top of page | Return to FAQ ]
14. Can I pass environment variables to my LAM/MPI processes on the remote nodes upon invocation?
Applies to LAM 6.3 and above |
Yes. The -x option to mpirun will explicitly pass environment variables to remote processes, and instantiate them before the user program is invoked (i.e., before main()). Multiple environment variables may be listed with the -x option, separated by commas:
% mpirun C -x DISPLAY,ALPHA_VALUE,BETA_VALUE myprogram
Additionally, all environment variables that have names that begin with LAM_MPI_ will automatically be exported to remote processes. The -nx option to mpirun will prevent this behavior. -x and -nx can be used together:
% mpirun C -nx -x DISPLAY,ALPHA_VALUE,BETA_VALUE myprogram
[ Top of page | Return to FAQ ]
15. mpirun -c and mpirun -np -- what's the
difference? |
They are very similar -- you can almost think of -c as a
synonym for -np.
The only difference is that you still need to specify a set of LAM
nodes with the -c option:
shell$ mpirun N -c 4 myprogram
will launch a total of 4 copies of myprogram,
potentially using all nodes available in LAM. For example, if there
are 4 nodes, then each node would get one process. If there are only
2 nodes, each node would get two processes. If there are 6 nodes, the
first four would get a single process. More to the point:
shell$ mpirun n0-1 -c 4 myprogram
will launch a total of 4 processes on the first two nodes in LAM
(i.e., 2 processes per node).
shell$ mpirun -np 4 myprogram
implies N (or C) -- a total of 4
processes will be launched, potentially using all nodes in LAM.
[ Top of page | Return to FAQ ]
|
16. What is "psuedo-tty support"? Do I want that? |
Pseudo-tty support enables, among other things, line-buffered output from the remote nodes. This is usually a Good Thing -- the stdout and stderr from multiple nodes will not overlap each other on the same line. This is probably what you want -- orderly output from all your nodes, as opposed to jumbled and potentially overlapping output.
Starting with LAM 6.5, pseudo-tty support is enabled by default. It can be turned off with the -npty command line option to mpirun.
[ Top of page | Return to FAQ ]
17. Why can't my process read stdin? |
When I execute the following code fragment:
int x;
printf("enter x =");
scanf("%d", &x );
and enter the value of x, as say 5, I get the
following
5 command not found.
My application does not seem to be reading standard input.
The solution to this is to use the -w option to mpirun
(or don't use the -nw option). This makes mpirun wait
for your MPI application to terminate. If you use -nw,
mpirun terminates after starting the application, and you return to
the shell with your MPI application running in the background and
competing with the shell for input.
In the example above the shell rather than the application got the
input 5 and couldn't find any command named 5.
[ Top of page | Return to FAQ ]
18. Why can only rank 0 read from stdin? |
LAM connects the stdin on all other ranks to
/dev/null. There simply is no better way to route the
standard input to all the different ranks.
If you need to use stdin on all of your ranks, you may
wish to write a shell script that executes an xterm (or
some other graphic command shell window) and then runs your MPI
application. Take the following shell script as an example:
#!/bin/csh -f
echo "MPI app on `hostname`: $DISPLAY"
xterm -e my_mpi_application
exit 0
If you mpirun this shell script (and export the DISPLAY
environment variable properly), an xterm window will pop
up on your display for each MPI rank with your MPI application running
in it.
Note that you will need to set up your environment to allow remote
X requests to your DISPLAY. This is typically achieved
with the xauth and/or xhost commands (not
discussed here).
[ Top of page | Return to FAQ ]
19. What is the lamd RPI module? |
In the lamd RPI module, all MPI messages are passed
between ranks via the LAM daemons that are launched at
lamboot. That is, for a message between process A and
process B, the message actually follows the route:
Process A ---> LAM daemon on node where process A resides
|
|
Process B <--- LAM daemon on node where process B resides
Note that the message actually takes 3 hops before it
reaches its destination. Also note that the LAM daemon where process
A resides may be the same as the LAM daemon where process B resides --
if process A and process B reside on the same node, they share a
common LAM daemon. In this case, there is only a total of two hops
for the message to go from process A to process B.
All other RPI modules general send messages directly from one MPI
process to its target process. For example, a message from process A
to process B traverses the following path:
Process A ---> Process B
That is, the LAM daemons are not involved in the communication at
all. All MPI messages take 1 hop to end up on the receiving side.
This begs the obvious question: why would you choose to use the
lamd RPI module, given that its definitely slower than
most other RPI modules? See the next FAQ question.
[ Top of page | Return to FAQ ]
20. Why would I use the lamd RPI module (vs. other RPI modules)? |
Although the lamd RPI module is typically slower than
other RPI modules (because MPI messages generally must take two extra
hops before ending up at their destination), the lamd RPI
has the following advantages over its peer RPI modules:
- Third party applications such as XMPI can monitor message passing,
and create reports on patterns and behavior of your MPI program.
- The LAM daemon can exhibit true asynchronous message passing
behavior. That is, the LAM daemon is effectively a separate thread of
execution, and can therefore make progress on message passing even
while the MPI application is not in an MPI function call. Since
LAM/MPI is currently a single-threaded MPI implementation, most other
RPI modules will only make progress on message passing while in MPI
function calls.
Therefore, MPI applications that can use latency-hiding techniques
can actually achieve good performance from the lamd RPI
module, even though the latency is higher than other RPI modules.
This strategy has been discussed on the LAM mailing list.
[ Top of page | Return to FAQ ]
|
21. How do I run LAM/MPI user programs on multi-processor machines? |
There are two options:
- New "C" syntax has been added to mpirun (note that this section
also applies to the "
lamexec" command). When running on
SMP machines, it is frequently desirable to group as many adjoining
ranks as possible on a single node in order to maximize shared memory
message passing. When used in conjunction with the extended
bootschema syntax (that allows the specification of number of CPUs
available on each host), the mpirun "C" syntax will run one executable
on each available CPU, and will group adjoining
MPI_COMM_WORLD ranks on the same nodes. For example,
when running on two SMPs, the first having four CPUs, and the second
having two CPUs, the following command:
shell$ mpirun C my_mpi_program
will run four copies of my_mpi_program on the four-way
SMP (MPI_COMM_WORLD ranks 0 through 3), and will run two copies of
my_mpi_program on the two-way SMP (MPI_COMM_WORLD ranks 4
and 5).
Just like the "N" syntax in mpirun, the "C" syntax can also be
used to indicate specific CPUs. For example:
shell$ mpirun c4,5 my_mpi_program
runs my_mpi_program on the fourth and fifth CPUs (i.e.,
the two-way SMP from the previous examples). "C" and "cX" syntax can
also be combined:
shell% mpirun c0 C master-slave
could be used to launch a "master" process (i.e., rank 0 in
MPI_COMM_WORLD) on CPU zero, a slave on each CPU (including rank 0.
This may be desirable, for example, in situations where the master
rank does very little computation).
The behavior of "-np" has been altered to match the "C" semantics.
"-np" now schedules across CPUs, not nodes. Using "-np 6" in the
previous example would be the same as "C"; using "-np 4" would run one
four copies of "foo" on the four-way SMP.
Also note that "N", "nX", C", and "cX" syntax can all be used
simultaneously, although it is not clear that this is really useful.
- An application schema file can be used to specify exactly what is
launched on each node. See the question "How do I run an MPMD program?"
[ Top of page | Return to FAQ ]
|
22. Can I mix multi-processor machines with uni-processor machines in a
single LAM/MPI user program run? |
Yes. LAM makes no restriction on what machines you can run on. LAM also allows you to specify which binaries (and how many) to run on each node. There are three ways to launch different numbers of jobs on nodes in a LAM cluster:
- lamboot has been extended to understand multiple CPUs on a single
host, and is intended to be used in conjunction with the new "C"
mpirun syntax for running on SMP machines (see the section on mpirun).
Multiple CPUs can be indicated in two ways: list a hostname multiple
times, or add a "
cpu=N" phrase to the host line (where "N" is the number of CPUs available on that host). For example, the following hostfile:
blinky
blinky
blinky
blinky
pinky cpu=2
indicates that there are four CPUs available on the "blinky" host, and
that there are two CPUs available on the "pinky" host. Note that this
works nicely in a PBS environment, because PBS will list a host
multiple times when multiple vnodes on a single node have been
allocated by the scheduler.
After this boot schema has been successfully booted, you can run on all CPUs with:
mpirun C foo
This will run four copies of foo on blinky and two copies of foo on pinky.
- Order the nodes in your boot schema such that your SMPs are grouped together. For example:
uniprocessor1
uniprocessor2
smp2way1
smp2way2
smp2way3
smp2way4
smp4way1
smp4way2
In the above boot schema, the first two machines are uniprocessors, the next four are 2-way SMPs, and the last 2 are 4-way SMPs. Launching an SPMD MPI process with the "correct" number of processes on each node can be accomplished with the following mpirun command:
% mpirun n0-1 n2-5 n2-5 n6-7 n6-7 n6-7 n6-7 myprogram
While not elegant, it will definitely work. The following table lists the nodes on which each rank will be launched:
| Rank |
Node name |
Node ID |
| 0 | uniprocessor1 | n0 |
| 1 | uniprocessor1 | n1 |
| 2 | smp2way1 | n2 |
| 3 | smp2way2 | n3 |
| 4 | smp2way3 | n4 |
| 5 | smp2way4 | n5 |
| 6 | smp2way1 | n2 |
| 7 | smp2way2 | n3 |
| 8 | smp2way3 | n4 |
| 9 | smp2way4 | n5 |
| 10 | smp4way1 | n6 |
| 11 | smp4way2 | n7 |
| 12 | smp4way1 | n6 |
| 13 | smp4way2 | n7 |
| 14 | smp4way1 | n6 |
| 15 | smp4way2 | n7 |
- Use an application schema file. Especially with N-way SMPs (where N>2), or for a large number of non-uniform SMPs, it can be an easier method of launching than many command line paramters to
mpirun. See the question "How do I run an MPMD application?"
[ Top of page | Return to FAQ ]
|
23. How do I run an MPMD program? More specifically -- how do I start
different binaries on each node? |
The easiest method is with the mpiexec command (v7.0 and
above). Other cases can use an application schema file. There are
two common scenarios where launching different executables on
different nodes are necessary: MPMD jobs and heterogeneous jobs.
-
mpiexec examples:
For MPMD jobs, multiple executables can be listed on the same
mpiexec command line:
shell$ mpiexec c0 manager : C worker
will launch manager on CPU 0, and launch
worker everywhere else. Heterogeneous environments can
benefit from this behavior as well; since different executables need
to be created for each architecture, it may be desirable to place them
in the same directory and name them differently. For example:
shell$ mpiexec -arch linux my_mpi_program.linux : \
-arch solaris my_mpi_program.solaris
This will launch my_mpi_program.linux on every Linux
node found in the current universe, and
my_mpi_program.solaris on every Solaris node in the LAM
universe. Note that the default scheduling in this case is by
node, not by CPU.
See the "Hetrogeneity" section of this FAQ for more details, as
well as that mpiexec(1) man page and the LAM/MPI User's
Guide.
- Alternatively, an application schema file(5) (frequently
abbreviated "app schema") can be used to specify the exact binaries
and run-time options that are started on each node in the LAM system.
An app schema can be used to start different binaries on each node,
specify different run-time options to different nodes, and/or start
different numbers of binaries on each node.
An app schema is an ASCII file that lists, one per line, a node ID
(or group of node IDs), the binary to be run, and all run time
options. For example:
c0 manager
C worker
This application schema starts a manager process on
c0, and also starts a worker process on all
CPUs that LAM was booted on. Note that this puts two processes on
c0 -- manager and the first of the
worker processes. To avoid this overal, the following
app schema can be used:
c0 manager
c1-8 worker
Note that all LAM options must come before the binary file
name (this is new starting with LAM 6.2b). User-specific command line
arguments can come after the binary name.
See the appschema(5) manual page for more information.
Note that LAM has no concept of scheduling on CPUs -- this
is the responsibility of the operating system. The "C" notation is
simply convenitent representation of how many jobs should be launched
on each node. LAM will launch that many jobs and let the operating
system handle all scheduling/CPU issues. So the prior example would
not necessarily have manager and the first
worker competing over the first CPU (assuming that
c0 is located on an SMP); it means that LAM would
schedule (M+1) programs on a machine with (M) processors.
[ Top of page | Return to FAQ ]
24. How do I mpirun across a heterogeneous cluster? |
This question is discussed in detail in the Heterogeneous section of this FAQ.
[ Top of page | Return to FAQ ]
|
25. My LAM/MPI process doesn't seem to reach MPI_INIT. Why? |
This can be for many reasons. Among the most common are:
[ Top of page | Return to FAQ ]
|
26. My LAM/MPI process seems to get "stuck" -- it runs for a while and
then just hangs. Why? |
This typically indicates either an error in communication patterns, or
code that assumes a large degree of message buffering on LAM's part
(which can result in deadlock).
The first case (an error in communication patterns) is usually
fairly easy to find: use the daemon mode of communication (i.e.,
specify the -lamd option to mpirun), and use
the mpitask command to check the state of the running
program. If the program does become deadlocked due to incorrect
communication patterns, mpitask will show the messages
that are queued up within LAM, as well as the MPI function that is
blocked in each process.
The second case is typically due to sending large amounts of
messages (or a small amount of large messages) without matching
receives. More to the point, in poorly ordered message passing
sequences, LAM's message queues (or the underlying operating system or
native message passing system's queues) can get full with pending
messages that have not been received yet. Consider the following code
snipit:
for (i = 0; i < size; i++)
if (i != my_rank)
MPI_Send(buf[i], MSG_SIZE, MPI_BYTE, i, tag,
MPI_COMM_WORLD);
for (i = 0; i < size; i++)
if (i != my_rank)
MPI_Recv(buf[i], MSG_SIZE, MPI_BYTE, i, tag,
MPI_COMM_WORLD, &status);
Notice that all the sends must complete before any of the receives
will be posted. For small values of MSG_SIZE, LAM will
execute this program correctly. For larger sizes of
MSG_SIZE, the program will "hang", because LAM's internal
message buffers/queues have been exhausted, and deadlock while waiting
for them to drain (this is actually an issue for all MPI
implementations -- it is not unique to LAM. LAM actually has more
buffering capabilities than most other MPI's).
[ Top of page | Return to FAQ ]
|
27. TCP performance under Linux 2.2.0-2.2.9 just plain sucks! Why? |
This is a problem in the 2.2.0-2.2.9 series of Linux kernels. There
is a specific size message at which LAM performan drops off
dramatically. There has been considerable discussion on the Linux
kernel newsgroups about whose fault this is -- the kernel, or the
application.
The problem appears to have been fixed starting with Linux 2.2.10.
If you have a kernel before this version, you should probably upgrade.
A much more comprehensive discussion of the problem
is available here.
[ Top of page | Return to FAQ ]
|