Table of contents:
- Are MPI programs truly portable?
- How does LAM launch user programs?
- How do I debug LAM/MPI programs?
- How do I launch a debugger for each rank in my process?
- How can I get a separate X window for each rank?
- What environment variables does LAM define?
- Can I run MPI programs with memory-checking tools such as
bcheck,
valgrind, or purify?
- Is LAM
purify clean?
- Why does my memory-checking debugger report memory leaks in LAM?
- Why does my memory-checking debugger report "read from uninitialized" in LAM?
- What does the error message "One of the processes started by mpirun has exited with a
nonzero exit code" mean?
- What does the error message "MPI_[function]: process in local group is dead (rank [N],
MPI_COMM_WORLD)" mean?
- My application deadlocks in LAM/MPI; it doesn't deadlock in other MPI implementations. Why?
[ Return to FAQ ]
|
1. Are MPI programs truly portable? |
Well, yes and no.
All conformant MPI programs will compile with any conformant MPI
implementation. That is, if you write a correct MPI program, it
should compile just about anywhere (most [if not all] major MPI
implementations have the correct MPI API), so cross-compilation -- in
terms of MPI calls -- is not much of an issue.
Indeed, we have "ported" several large scale MPI programs to multiple
architectures with different MPI implementations without much trouble.
The catch is that every MPI implementation is not created equal. The
MPI standard was written with some very specific points, and some very
loose points -- most of which were on purpose. That is, the standard
does leave some leeway for the implementor to choose exactly what (and
how) specific actions are to be performed. So even though your
program will most likely compile under all existing MPI
implementations, it may work slightly differently with different
implementations.
This is not a major concern, but it is something that the MPI
programmer needs to be aware of. Indeed, most MPI-1 implementations
are reasonably similar. However, several MPI-2 functions, for
example, take MPI_Info arguments, which are specifically
designed to be implementation-dependant. This means that even though
the call to, for example, MPI_COMM_SPAWN is completely
portable, the building of the MPI_Info argument for that
call is not.
For that reason, LAM/MPI defines the preprocessor macro
LAM_MPI to be 1. MPI programmers can use
this for LAM-specific code, if necessary. For example:
#if LAM_MPI
/* Do LAM-specific things here */
#endif
MPI_Comm_Spawn(.....);
[ Top of page | Return to FAQ ]
|
2. How does LAM launch user programs? |
This question is relevant to debugging; understanding the basics of how LAM launches uses programs will help in debugging by allowing you to take advantage of some of LAM's features with shell scripts.
When you use mpirun to invoke a program on a remote node (either with an app schema, or by a simple command line invocation), the command (and any arguments) to be executed is sent to the remote LAM daemon.
The remote LAM daemon does the following (in no particular order):
- Forks a new Unix process
- Sets any exported environment variables from
mpirun
- Sets several environment variables (mainly for internal use) that contain all the relevant MPI state information -- to include all run-mode command line switches to
mpirun (this is how LAM initializes your process without adding extra command line arguments)
- Redirect
stdin and stdout
- exec's the user's command (with all command line arugments)
The process runs until it (or one of its children) executes MPI_INIT. MPI_INIT will perform some communication with mpirun to obtain the location and identification of all other ranks in MPI_COMM_WORLD. Once each rank knows about all other ranks, if the program was launched with C2C mode, each rank will perform a "dance" to obtain a direct socket to each other rank.
Once all of this has been accomplished, MPI_INIT returns and the user program progresses as normal.
Hence, if you use mpirun to launch non-MPI programs, mpirun will hang while waiting for the non-MPI program to send back its location information.
[ Top of page | Return to FAQ ]
|
3. How do I debug LAM/MPI programs? |
mpirun can be used to launch non-MPI programs (as long as
the programs that you run eventually launch LAM/MPI programs).
Reading between the lines, this means that you can use
mpirun to launch debuggers on remote nodes. This is
especially helpful to find race conditions, memory problems, etc.,
that were previously very difficult to find because LAM was not
"debugger friendly".
Many users still prefer to use printf-style debugging
(i.e., insertting printf (C) or WRITE
(FORTRAN) statements throughout their code), but this is haphazard and
ends up littering your code with spurrious output, and can be a
serious detriment to performance -- output to the
screen is extremely slow in comparison to the FLOPS that a computer is
capable of).
Speaking from experience, the LAM Team has found the use of
debuggers to be extremely helpful in debugging MPI
programs. We highly recommend it over
printf debugging (which is why we built in the ability to
have mpirun execute non-LAM/MPI programs).
[ Top of page | Return to FAQ ]
4. How do I launch a debugger for each rank in my process?
Applies to LAM 6.3 and above |
Since all ranks except rank 0 have their stdin tied to
/dev/null, it is necessary to launch text-based debuggers
(such as gdb) in separate X windows. If you are using a
GUI-based debugger, you can simply mpirun that debugger
directly on each node.
For GUI debuggers, you will probably need to export the
DISPLAY environment variable.
NOTE: If you are using the
rsh boot SSI module with the ssh remote agent, you
cannot use SSH's default X forwarding. This is because SSH's X forwarding
only exists while ssh is running, but ssh
will have completed and exited normally before a successful lamboot completes. Hence,
you must generate your own DISPLAY that is suitable for remote nodes to write to your
display.
For text debuggers, you will need a short shell script to launch an
xterm (or whatever your favorite X window program is --
not all systems have xterm -- other terminal programs can
be used instead, such as konsole,
gnome_terminal, etc.). For example:
% mpirun N -x DISPLAY run_gdb.csh my_program_name
Where run_gdb.csh is a shell script, and
my_program_name is the name of your LAM/MPI executable.
An example run_gdb.csh is shown below:
#!/bin/csh -f
echo "Running GDB on node `hostname`"
xterm -e gdb $*
exit 0
Also note that the DISPLAY environment variable is exported to the
remote nodes with mpirun. This is necessary so that the remote nodes
know where to send the X display of the xterm. Be sure that the
DISPLAY contents are suitable for sending to your display (e.g., setting it
to "your_hostname:0" will be suitable on many systems) and that the host you are running
on has remote access for X enabled. You may need to see the man pages for
xauth(1) and/or xhost(1) for more information on
remote X display authentication.
If you are not running in an X enviornment, or wish to debug only
one process you can use a script such as:
#!/bin/csh -f
if ("$LAMRANK" == "0") then
gdb $*
else
$*
endif
exit 0
[ Top of page | Return to FAQ ]
5. How can I get a separate X window for each rank?
Applies to LAM 6.3 and above |
Sometimes it is desirable to launch each rank's process in a different
window. This allows separating of output (i.e., each rank's output
will therefore be in a separate window), the use of stdin
on each rank, etc. This can be especially handy when debugging
LAM/MPI programs.
You will need a short shell script to accomplish this. For
example, the following example script is named run_xterm.csh:
#!/bin/csh -f
echo "Running xterm on `hostname`"
xterm -e $*
exit 0
This will run an xterm with the specified arguments as the command
running in that window (note that not all systems have
xterm -- other terminal programs can be used instead,
such as konsole, gnome_terminal, etc.). The
following mpirun command can
be used to launch this script:
% mpirun C -x DISPLAY run_xterm.csh my_mpi_program
Note how my_mpi_program is given as an argument to
run_xterm.csh, which is then invoked in the
xterm line in the script.
Also note that the DISPLAY environment variable is
exported to the remote nodes with mpirun. This is
necessary so that the remote nodes know where to send the X display of
the xterm. Be sure that the host you are running on has
remote access for X enabled. You may need to see the man pages for
xauth(1) and/or xhost(1).
NOTE: If you are using the
rsh boot SSI module with the ssh remote agent, you
cannot use SSH's default X forwarding. This is because SSH's X forwarding
only exists while ssh is running, but ssh
will have completed and exited normally before a successful lamboot completes. Hence,
you must generate your own DISPLAY that is suitable for remote nodes to write to your
display.
[ Top of page | Return to FAQ ]
6. What environment variables does LAM define?
Applies to LAM 6.3 and above |
Before executing user programs, LAM defines several environment
variables to be inherited by the user process. While the majority of
the variables are only meaningful inside of LAM, the
LAMRANK environment variable may be useful to the user.
The LAMRANK variable will contain a number from 0 to
(n-1), and indicates what rank the process will be in
MPI_COMM_WORLD. This variable can be used to make
decisions at execution time, especially if a shell script is launched
via mpirun. Consider the following shell script:
#!/bin/csh -f
# $* will contain the name of the executable to run, as well as
# all the arguments that were passed in from mpirun
# This will run the user program (with all arugments from mpirun
# and direct the output to the files "mpi_output.0" through
# "mpi_output.(n-1)"
$* > mpi_output.$LAMRANK
exit 0
Also note from this shell script that to launch LAM/MPI executables
from within shell scripts that have been launched from
mpirun, you just execute them. Do not use mpirun from within the script!
[ Top of page | Return to FAQ ]
7. Can I run MPI programs with memory-checking tools such as bcheck,
valgrind, or purify? |
Yes. Since LAM allows you to mpirun non-MPI programs, you can either
mpirun bcheck or valgrind directly, or
write a short shell script to perform some "smart" execution decisions to limit your output.
For example, the following script will only invoke bcheck (the Solaris
native memory-checking debugger) on rank 0, and ensure that the output report files are
in a specific directory:
#!/bin/csh -f
# Only have rank 0 execute bcheck. The LAMRANK environment
# variable contains a number from 0 to (n-1).
if ($LAMRANK == "0") then
# Make a directory based upon the host name
set host=`hostname`
if (! -d $host) mkdir $host
cd $host
bcheck -all $*
else
# If we are not rank 0, just run the executable (and all of
# its arguments)
$*
endif
exit 0
Purify is slightly different -- the purify command must be used to
compile the actual MPI application. Unfortunately, it seems that at least some versions of
Purify don't understand the LAM wrapper compilers (mpicc,
mpiCC, and mpif77). Hence, the typical solution is to
have the LAM wrapper compilers invoke purify (instead of the other way
around). Specifically, the following won't work:
shell$ purify mpicc my_application.c -o my_application
Instead, tell the LAM wrapper compilers to use purify as the underlying
compiler. For example, to set the underlying compiler that the mpicc
wrapper compiler uses, set the environment variable LAMHCP. For
Bourne-like shells:
shell$ LAMMPICC="purify cc"
shell$ export LAMMPICC
shell$ mpicc my_application.c -o my_application
For csh-like shells:
shell% setenv LAMMPICC "purify cc"
shell% mpicc my_application.c -o my_application
The LAMMPICXX and LAMMPIF77 environment variables
can be used to override the underlying compilers for the mpiCC /
mpic++ and mpif77 wrapper compilers, repestively.
Note that the older (deprecated) environment variable names
LAMHCC, LAMHCP, and LAMHF77 also
still work for version 7.0 and above; these are the only names that work
prior to version 7.0.
WARNING: Do not arbitrarily change the
back-end compiler in the wrapper compilers; Badness can occur (read: seg faults and
other strange behavior in MPI applications) if you arbitraily mix vendor compilers. For
example, this kind of behavior can occur if LAM was configured and compiled with one
compiler and you change the back end of the wrapper compilers to use a different set of
compilers.
[ Top of page | Return to FAQ ]
When compiled with the --with-purify option to configure, LAM 6.3 is purify clean (--with-purify is not the default for configure because it is a slight performance hit inside of LAM). LAM will function correctly with or without --with-purify.
[ Top of page | Return to FAQ ]
|
9. Why does my memory-checking debugger report memory leaks in LAM? |
As far as we know, we have plugged all memory leaks in the LAM code. However, there are a few leaks from various operating system calls that we can't do anything about (for example, getpwuid() on Solaris 2.6 leaks a few bytes).
If you find any other memory leaks, please let the LAM Team know so that they can be
fixed in future releases.
[ Top of page | Return to FAQ ]
|
10. Why does my memory-checking debugger report "read from uninitialized" in LAM? |
LAM has a standard message structure that is uses for most internal
communications. This structure has several fields that are not used
for all types of communications. In situations where fields are not
used, they are not intialized for the sake of optimization. So when
the message is sent, the entire message structure is sent -
to include the unintialized values. This is not a problem for LAM,
because the receiver will ignore these fields. But it does generate
"read from unintialized" warnings on the sending side when using
memory-checking debuggers.
The --with-purify option to the LAM configure script will enable code within LAM to zero out all message structures before they are used. This must be selected at compile time, because the code that zeros out structures is conditionally compiled into LAM (it is a compile-time decision, not a run-time decision).
When using LAM with the --with-purify option, this
may cause a slight performance hit, particularly when using the shared
memory RPI's. Most users won't notice the extra overhead though,
since zeroing LAM's internal message headers are a small constant size
(i.e., the overhead is the same for a 1 byte messages as as 1MB
message).
[ Top of page | Return to FAQ ]
|
11. What does the error message "One of the processes started by mpirun has exited with a
nonzero exit code" mean? |
This means that at least one MPI processes exited after invoking MPI_INIT
but before invoking MPI_FINALIZE.
It typically indicates an error in the MPI application. LAM will abort the entire MPI
application upon this error. The last line of the error message indicates the PID, node, and
exit status of the failed process (note that there may be multiple failed processes -- LAM
will only report the first one).
If this is happening to your application, it is recommended that you run your
application
through a memory checking debugger (such as Valgrind, Bcheck, or Purify) and look for
buffer overflows, erroneous memory usage, or other kinds of subtle memory problems. Be
sure to read the FAQ "Can I run MPI programs with memory-checking tools such as
bcheck, valgrind, or purify?".
[ Top of page | Return to FAQ ]
|
12. What does the error message "MPI_[function]: process in local group is dead (rank [N],
MPI_COMM_WORLD)" mean? |
This means that some MPI function tried to communicate with a peer MPI process and
discovered that the peer process is dead.
Common causes of this problem include attempting to communicate with processes
that have failed (which, in some cases, won’t generate the “One of the processes started
by [mpirun] has exited...” error message), or have already invoked
MPI_FINALIZE.
Communication should not be initiated that could involve processes that have already
invoked MPI_FINALIZE. This may include using
MPI_ANY_SOURCE or collectives on
communicators that include processes that have already finalized.
[ Top of page | Return to FAQ ]
|
13. My application deadlocks in LAM/MPI; it doesn't deadlock in other MPI implementations. Why? |
A common mistake for MPI application portabillity is assuming buffering on sends. This is
described in detail in the MPI-1 standard.
Consider the following code.
if (rank == 0) {
MPI_Send(..., 1, tag, MPI_COMM_WORLD);
MPI_Recv(..., 1, tag, MPI_COMM_WORLD, &status);
} else if (rank == 1) {
MPI_Send(..., 0, tag, MPI_COMM_WORLD);
MPI_Recv(..., 0, tag, MPI_COMM_WORLD, &status);
}
When the messages are not buffered, rank 0's MPI_SEND does not
complete until
rank 1's MPI_RECV is posted. Similarly, rank 1's
MPI_SEND does not complete until rank
0's MPI_RECV is not posted. This results in a deadlock. The only cases
where this does
not result in a deadlock is when the MPI implementation decides to buffer the
MPI_SEND
thereby allowing the posting of the MPI_RECV. However, the code will not
be portable
since this way of avoiding deadlocks is network, implementation, and potentially message
size dependant.
There are several ways to fix this problem. Here are two:
- Reverse the order of one of the send/receive pairs:
if (rank == 0) {
MPI_Send(..., 1, tag, MPI_COMM_WORLD);
MPI_Recv(..., 1, tag, MPI_COMM_WORLD, &status);
} else if (rank == 1) {
MPI_Recv(..., 0, tag, MPI_COMM_WORLD, &status);
MPI_Send(..., 0, tag, MPI_COMM_WORLD);
}
- Make at least one of the
MPI_SEND's non-blocking
(MPI_ISEND)
if (rank == 0) {
MPI_Isend(..., 1, tag, MPI_COMM_WORLD, &req);
MPI_Recv(..., 1, tag, MPI_COMM_WORLD, &status);
MPI_Wait(&req, &status);
} else if (rank == 1) {
MPI_Recv(..., 0, tag, MPI_COMM_WORLD, &status);
MPI_Send(..., 0, tag, MPI_COMM_WORLD);
}
[ Top of page | Return to FAQ ]
|