Nick Nevin
Ohio Supercomputer Center Technical Report OSC-TR-1996-4
Columbus, Ohio
Bibtex reference
This report provides some measurements of the performance of the
LAM 6.0 and MPICH 1.0.12 implementations of Message Passing Interface
(MPI).
A PostScript version of this report and other related papers are
available from the LAM download
site.
Test Conditions
The tests were run on a FDDI network cluster of eight DEC 3000/300
workstations running OSF/1 V3.2.
All benchmarks and both libraries were compiled with the standard
DEC C compiler with -O optimization.
The MPICH MPI library
was configured for the ch_p4 device. The default configuration was
used apart from setting -O level compiler optimization and setting
-nodevdebug in order to exclude debugging overhead.
The LAM
MPI library was
configured in the standard way with -O level compiler
optimization.
All LAM tests used the -c2c, -nger and -O switches to mpirun. The
first selects client-to-client mode in which the LAM library bypasses
the daemon and clients communicate directly. The second turns off the
Guaranteed Envelope Resources feature of LAM. The third informs the
LAM/MPI library that the cluster is homogeneous and hence turns off
data conversion.
No special run-time switches were used when running MPICH tests. The
MPICH library detects upon initialization that the cluster is
homogeneous.
The LAM and MPICH libraries differ in how they set up communication
channels between MPI processes. In client-to-client mode LAM sets up a
fully connected network at initialization time whereas MPICH makes
connections on a demand driven basis. To ensure that connection setup
time was not included in the tests, all the benchmark programs perform
some communications before the timing phase in order to force the
establishment of all the necessary connections.
The Benchmarks
A suite of six benchmark programs was used. In these programs all the
MPI communications measured use datatype MPI_BYTE which ensures that
neither message data conversion nor packing is done.
The ping and ping-pong tests measure non-blocking
point-to-point communication performance. Both these tests are run in
a MPI_COMM_WORLD of size two, each process on a separate node.
The barrier, broadcast, gather and
alltoall tests measure the performance of the corresponding
MPI collective communication functions. These tests are run in a
MPI_COMM_WORLD of size eight, one process per node.
Timings were done with MPI_Wtime which in both libraries is
implemented on top of the UNIX gettimeofday system call. Since the
granularity of gettimeofday is not very fine timings are obtained by
surrounding a loop of communications with calls to MPI_Wtime and
dividing the difference of the times thus obtained by the number of
iterations performed. We call this final measure of elapsed time an
observation.
For each benchmark and for each data size considered we run an
experiment in which 20 observations are measured as described
above. The final data-point is then the mean of these twenty
observations.
In this report we present in graphical form the mean times over the 20
observations for each experiment. All times are given in seconds.
The raw data includes the mean,
standard deviation, minimum and maximum of the 20 observations.
Ping
In this test one process is run on each of two nodes from the cluster.
Process rank 0 loops calling MPI_Send with destination rank 1. Process
rank 1 loops calling MPI_Recv with source rank 0.
LAM 6.0 client-to-client mode by default uses a short message protocol
on messages up to 8192 bytes in length. It switches over to a long
message protocol for longer messages. By default MPICH changes
protocol at 16384 bytes. The effect of the LAM protocol can be seen
quite clearly here and in the ping-pong benchmark. The
maximum length of a short message can be changed in both
implementations at compile time by setting the appropriate
constant.


Ping-Pong
This test is similar to the ping test except that here the
two processes both send and receive. Process rank 0 loops calling
MPI_Send with destination rank 1 followed my MPI_Recv from source node
1. Process rank 1 loops calling MPI_Recv with source rank 0 followed
by MPI_Send with destination rank 0.


Barrier
In this test one process is run on each of the eight nodes in the
cluster. Each process loops calling MPI_Barrier. The time reported is
for process rank 0.
LAM
mean: 0.005185
st.dev: 0.000957
MPICH
mean: 0.007268
st.dev: 0.000189
Broadcast
This benchmark is designed along the lines of the methodology
described in [1]. One MPI process is run on each of the eight nodes
in the cluster. For each non-root process in the broadcast we time at
the root a loop of broadcasts followed by the receive of a zero length
message from the non-root process which does a loop of broadcasts
followed immediately by the send of a zero length message to the root.
The maximum time taken over all the non-root processes then gives an
estimate of the maximum time time taken by any process participating
in the broadcast. This maximum time is what is shown in the graph
plotted against the byte count which refers to the size of the data
sent by the root to each process. The timings for the individual leaf
processes can be found in complete data listing.
Note that the MPICH implementation does no communication for a data
size of zero. LAM does not check for this special case and the root
sends zero length messages.

Gather
In this test one process is run on each of the eight nodes in the
cluster. Each process loops calling MPI_Gather with root 0. The time
reported is for process rank 0 and the byte count refers to the size
of the data sent by each process to the root.

Alltoall
In this test one process is run on each of the eight nodes in the
cluster. Each process loops calling MPI_Allgather. The time reported
is for process rank 0 and the byte count refers to the size of the
data sent by each process to each other process.

References
- Nupairoj, Natawat and Lionel M. N.
"Benchmarking of Multicast Communication Services", Technical Report
MSU-CPS-ACS-103, Michigan State University, April 1 1995.
|