Table of contents:
- What exactly is considered a "heterogeneous" cluster?
- Does LAM/MPI work on heterogeneous clusters?
- Do different versions of LAM/MPI constitute heterogeneous clusters?
- How do I install LAM on a heterogeneous cluster?
- How do I
lamboot across a heterogeneous cluster?
- How do I execute the right binary on each node for each architecture
in a heterogeneous system?
- Can I mix 32 and 64 bit executables in a single parallel MPI job?
[ Return to FAQ ]
|
1. What exactly is considered a "heterogeneous" cluster? |
It is probably easier to define what a "homogeneous" cluster is -- a
"heterogeneous" cluster is anything that is not a "homogeneous" cluster.
A homogeneous cluster is one where all the nodes have the same:
- Architecture
- Operating system (to include the same OS version)
- Key component libraries, such as
libc or
glibc on Linux and freeware BSD operating systems (as
above, including the same library versions)
The first requirement -- same architecture -- has a bit of leeway.
For example, two Pentium III machines with different amounts of RAM
or a different CPU speed would still be considered homogeneous. In
general, homogeneity is determined by whether the software compiled on
one machine can run natively on another. In the case of the
same CPU but different amounts of RAM or a different CPU speed, this
is most likely true. This is not necessarily true between a
Pentium II and a Pentium III, for example.
For example, the following cluster is considered homogeneous:
- 32 Pentium III machines, each running a stock Red Hat 7.1
installation updated with all the most recent patches from Red Hat.
Note that it wasn't necessary to list the Linux kernel version
and/or glibc version because they're all the same by
virtue of being the same Linux distribution and version.
The following are some example clusters that are not
homogeneous -- they are heterogeneous:
- 16 Pentium III nodes running Red Hat 7.1, 16 Pentium III nodes
running Red Hat 7.0. Yes, even a minor difference in operating system
constitutes being "different enough" to be heterogeneous.
- 16 Pentium III nodes running Red Hat 7.1, 16 Pentium III nodes
running Mandrake 8.0. This one is questionable, since Mandrake
professes to be compatible with Red Hat. So to be safe, call it
heterogeneous.
- 16 Pentium III nodes running Red Hat 7.1, 16 Pentium III nodes
running SuSE 7.2. This is most likely heterogeneous since the linux
distributions are different; it is possible that the Linux kernel
versions are different, different versions of the GNU compilers are
installed, and/or different versions of
glibc are used, etc.
- 16 Pentium III nodes running Red Hat 7.1, 16 Pentium III nodes
running OpenBSD 2.9. These are clearly two different operating systems.
- 16 Pentium II nodes and 16 Pentium III nodes all running Red Hat
7.1. You could play some tricks and treat this as a homogeneous
cluster, but it is probably safer (and more efficient) to treat this
as a heterogeneous cluster.
- 16 SunBlade 1000 nodes running Solaris 8, 16 SunBlade nodes
running Solaris 9. The operating system difference makes this
heterogeneous.
- 16 SubBlade 1000 nodes running Solaris 8, 16 Pentium III nodes
running Red Hat 7.1. The architecture difference makes this
heterogeneous.
[ Top of page | Return to FAQ ]
|
2. Does LAM/MPI work on heterogeneous clusters? |
Yes -- that's one of the reasons that LAM/MPI exists.
LAM/MPI will work between just about any flavor of POSIX (with a few restrictions). That is, you can have two completely different machines (e.g., a Sun machine and an Intel-based machine), and LAM will run on both of them. More importantly, you can run a single parallel job that spans both of them.
LAM will transparently do any data conversion necessary.
An important restriction is that LAM does not currently support systems that have datatypes that are different sizes. For example, if an integer is 64 bits on one machine and is 32 bits on another, LAM's behavior is undefined. Also, LAM requires that floating point formats be the same. That is, endianness can be different, but the same general format must be obeyed by all participating machines (e.g., older Alpha machines do not adhere to IEEE floating point standard by default -- such machines can be used in parallel jobs with other similar machines, but to use them in a heterogeneous situation would require adherence to the IEEE floating point standards so that all nodes in the parallel job understand the same floating point formats).
Indeed, what is the Right Thing for an MPI to do in these kinds of situations, anyway? There really is no good answer -- having MPI truncate when 64 bit integers are sent to 32 bit integers is not desirable, nor is having the MPI translate from one floating point format to another (for similar loss of precision reasons).
[ Top of page | Return to FAQ ]
|
3. Do different versions of LAM/MPI constitute heterogeneous clusters? |
Strictly speaking, yes.
BUT different versions of LAM will not work together. In order to successfully lamboot and mpirun, you must use the same version of LAM/MPI on all nodes, regardless of their operating system, architecture, etc.
So a better answer is really: Yes, but don't ever, ever do this.
[ Top of page | Return to FAQ ]
|
4. How do I install LAM on a heterogeneous cluster? |
In general, LAM must be compiled and installed separatedly for each
different kind of node in a heterogeneous cluster.
Other questions in this FAQ discuss how to install LAM across a
[homogeneous] cluster -- there are two general schemes:
- Install LAM on one node, and make the directory tree that LAM was
installed to available to all nodes via a networked filesystem (such
as NFS)
- Physically install LAM on each node in the cluster
Both of these methods are possible for heterogeneous clusters as
well. Physically installing LAM on each node in the cluster is the
safest, least complicated way to do this. However, it is potentially
the most labor intensive, and most difficult to maintain over time.
In most cases, there will be multiple nodes of each kind in a
heterogeneous cluster. As such, it may be useful to consider a
heterogeneous cluster to be a group of homogeneous clusters. So
although local policies and requirements may vary, the LAM Team
recommends that LAM is installed on a networked filesystem in each
homogeneous cluster.
NOTE: There are some scalability issues
with using networked filesystems on large clusters. As such, it may
not be sufficient or desirable to use the common filesystem model at
your site, depending on the size of your cluster and your choice of
networked filesystem. YMMV.
For example, consider a cluster of 16 Pentium II nodes running Red
Hat 7.0 and second group of 16 Pentium III nodes running Red Hat 7.1.
Both the architecture difference and operating system difference make
these sub-clusters heterogeneous.
In the common filesystem model, LAM will need to be installed
twice for the heterogeneous cluster described above -- once for the
PII/RH7.0 machines, and once for the PIII/RH7.1 machines. Each
machine in the cluster will need to either mount the appropriate LAM
installation, and/or user paths will need to be set appropriately on
each node in the cluster to point to the appropriate LAM installation.
The same holds true for more obviously-heterogeneous clusters,
such as a group of UltraSparc machines running Solaris and a group of
Pentium III machines running some flavor of Linux.
[ Top of page | Return to FAQ ]
5. How do I lamboot across a heterogeneous cluster? |
In additions to the normal requirements for lamboot, the
additional requirements must be satisfied:
- All nodes being
lambooting must be using the same
version of LAM (this is actually always a requirement -- this is just
a clarification that "heterogeneous" does not mean "different versions
of LAM/MPI").
- Each user's
$PATH must be setup properly to find the
Right version of LAM/MPI on each node. That is, if multiple
installations of LAM are available on each node, the user's
$PATH must be set to find the appropriate installation
for that node. For example, if LAM is installed on a networked
filesystem for two different architectures in:
/home/lam/sparc-sun-solaris2.8
/home/lam/linux-redhat7.1
/home/lam/linux-suse7.2
If /home/lam is NFS mounted on all nodes in the
cluster, the user's $PATH must be set to use one of those
three trees as appropriate for the kind of node that they are logged
in to. This is typically set in the user's dot files (e.g.,
$HOME/.profile, $HOME/.cshrc, etc.), or in a
system-wide default dot file (these vary between different operating
systems).
[ Top of page | Return to FAQ ]
|
6. How do I execute the right binary on each node for each architecture
in a heterogeneous system? |
There are three cases:
- If the right binaries are in the current working directory, and
the current working directory is available on all nodes,
mpiexec can execute them directly. For example:
shell$ mpiexec -arch linux my_mpi_program.linux : \
-arch solaris my_mpi_program.solaris
LAM will look for Linux architecture nodes in the current universe
and launch the executable my_mpi_program.linux.
Similarly, LAM will launch the executable
my_mpi_program.solaris on all Solaris nodes in the
universe. The string after the -arch switch specifies a
text string to match from the output of the GNU
config.guess script (i.e., the output from
laminfo in the architecture line).
The -arch switch to mpiexec can be be
used in other cases (e.g., absolute path names); this is just one
example. See the manual page for mpiexec(1) for more
details.
- If the
$PATH variable is set correctly for each node
that LAM uses (i.e., separate directories exist containing MPI
binaries for each architecture, and the correct directory for each
architecture is inserted into the $PATH on each node),
mpirun C foo will automatically find the foo
for the right architecture.
- However, most users do not set their
$PATH variable
in this fashion. If mpiexec is not suitable, you will
more than likely need to use an application schema ("app schema") file
for this case. In the app schema, it is usually easiest to specify
the absolute pathname of the program for each node. For example,
using the following boot schema file:
sun1
sun2
hp1
hp2
redhat1
redhat2
suse1
suse2
we can use the following app schema file to launch the "right" copy
of foo for each architecture:
n0-1 /home/jshmo/mpi/sun-sparc-solaris2.6/foo
n2-3 /home/jshmo/mpi/hppa2.0w-hp-hpux11.00/foo
n4-5 /home/jshmo/mpi/linux-redhat7.1/foo
n6-7 /home/jshmo/mpi/linux-suse7.2/foo
Remember, it may be necessary to have different versions of the
MPI binary for each OS version as well as each machine architecture.
For example, you may need to have a separate versions for Solaris 2.5
and 2.6. This is also true when running between different linux
distributions -- as shown in the example above where Red Hat and SuSE
are considered different operating systems and therefore have their
own copy of foo.
[ Top of page | Return to FAQ ]
|
7. Can I mix 32 and 64 bit executables in a single parallel MPI job? |
By definition, a mixture of 32 and 64 bit machines is a heterogenous cluster.
LAM/MPI allows two possibilities for mixing 32 and 64 bit machines in a single
parallel job:
- Most 64 bit operating systems have the capacity to generate 32 bit executables. By
doing so, one can make the cluster "homogenous" (at least in terms of bit size). Once all
the executables (including relevant libraries) are 32 bits, one can run MPI jobs as if it were
a homogenous cluster. Note that LAM/MPI libraries and executables should also be built
as 32 bit libraries/executables.
This solution works well and avoids many complicated
situations which arise out of mixing 32 and 64 bit executables and are outside the scope
of MPI (see discussion below).
- The differences in datatype sizes between 32 and 64 bit machcines are likely to create
problems. Consider the scenario where a 64 process sends a message containing
MPI_LONG data to a 32 bit process. What is the size of the datatype? On
the 64 bit
machine, each MPI_LONG is likely to be 64 bits, but on the 32 bit
machine, they are likely
to be 32 bits. So what should the 32 bit process do when it receives the data?
There is, unfortunately, no good answer to this. Obvious choices include invoking an
error or truncating the data, neither of which are attractive. Debugging such applications
is non-trivial and therefore this is not the preferred solution.
NOTE: LAM/MPI has not been tested in
this kind of configuration! It may work (if the user application stays away
from messages with mismatched data sizes), but it may not... Consider yourself
warned.
[ Top of page | Return to FAQ ]
|