Several people have noted on the LAM mailing list some odd
performance oddities with LAM running under Linux
2.2.x. In order to determine whether this is a problem
with Linux or with LAM, an extensive testing of
performance is going on.
The Test Programs
All of the source code for the test programs is available in a
gzip'ed tar ball.
MPI Test Program
The
Test Program is a simple "ping-pong" program which
sends a message back
and forth between two nodes, timing how long the round
trip takes. The test is repeated for messages of size 1
byte to 8 megabytes. The output is in a format that can
be immediately be executed by matlab to produce graphs.
Note: If you wish to just view the graphs,
you can use the graphomatic.
Reference Platforms
The issue at hand is definitely a Linux issue. Although
it could possibly be the fault of LAM, it only occurs on
Systems running Linux 2.2.x. The test programs have been
run on several non-Linux 2.2.x platforms in order to show
this.
Solaris
In order to have data from a non-Linux platform to
compare the Linux data against, the pingpong.c program
was run on Solaris 2.6 (sparc-sun-solaris2.6) running
LAM 6.2b, and on Solaris 2.5.1
(sparc-sun-solaris2.5.1) running LAM 6.3-b2
Solaris 2.6, Lam 6.2b
-lamd mode
The -lamd mode is the default mode that LAM
uses to run programs. Every lam process
communicates to the local lamd, which takes
care of passing messages to other
processes. The graph shows a simple curve:
fairly constant time to send a single
message until a certain point (10^4 in this
case) is reached, at which point sending
time grows with message size at an expected
rate.
-c2c mode
The -c2c mode is the client-to-client mode:
the local lamd's are only responsible for
starting up the processes, after which they
communicate directly with each other,
bypassing the lam daemon. This graph shows
again the same behavior as is seen with
-lamd, except the sending times are a little
lower (-c2c mode is generally faster).
Solaris 2.5.1, Lam 6.3-b2
-lamd mode
Basically the same graph as for Solaris 2.6,
except the 2.5.1 machines tested were on 10
Mbits/sec ethernet as opposed to 100 Mbits/sec
for the 2.6 boxes, so that we see slower
sending times.
-c2c mode
Basically the same graph as for Solaris 2.6,
except the 2.5.1 machines tested were on 10
Mbits/sec ethernet as opposed to 100 Mbits/sec
for the 2.6 boxes, so that we see slower
sending times.
Linux 2.0.x
The odd performace problems we've witnessed in LAM
only occur in Linux 2.2.x, not in Linux 2.0.x,
indicating a potential problem in the Linux 2.2.x
tcp/ip implementation. It is also possible that LAM
is doing something odd which manifests itself as a
performance problem only in Linux 2.2.x. This is
what we are trying to determine.
MPI Test Program
The next two graphs show the pingpong test
program running on Linux 2.0.36, using LAM 6.3b2.
These tests were run on a Dual Processor machine
using loopback networking.
-lamd mode
The -lamd mode is the default mode that LAM
uses to run programs. Every lam process
communicates to the local lamd, which takes
care of passing messages to other
processes. The graph shows a simple curve:
fairly constant time to send a single
message until a certain point (10^4 in this
case) is reached, at which point sending
time grows with message size at an expected
rate.
-c2c mode
The -c2c mode is the client-to-client mode:
the local lamd's are only responsible for
starting up the processes, after which they
communicate directly with each other,
bypassing the lam daemon. This graph shows
again the same behavior as is seen with
-lamd, except the sending times are a little
lower (-c2c mode is generally faster).
This next two graphs show the pingpong test
program running on Linux 2.0.36, using LAM
6.3-b2. The test was run between two machines on
the same 10Mbit/sec ethernet segment.
-lamd mode
The -lamd mode is the default mode that LAM
uses to run programs. Every lam process
communicates to the local lamd, which takes
care of passing messages to other
processes. The graph shows a simple curve:
fairly constant time to send a single
message until a certain point (10^4 in this
case) is reached, at which point sending
time grows with message size at an expected
rate.
-c2c mode
The -c2c mode is the client-to-client mode:
the local lamd's are only responsible for
starting up the processes, after which they
communicate directly with each other,
bypassing the lam daemon. This graph shows
again the same behavior as is seen with
-lamd, except the sending times are a little
lower (-c2c mode is generally faster).
TCP Test Program
tests still need running...
Linux 2.2 -- what is going on?
Linux 2.2.9 (and possibly lower)
Two versions of Linux 2.2.x were tested: 2.2.9 and
2.2.10. It seems that the graphs we see for 2.2.9 and
reports we've seen elsewhere suggest that lower 2.2.x
versions exhibit similar behavior to 2.2.9.
MPI Test Program
Using our pingpong program we see a large
performance drop when using -c2c mode in Linux
2.2.x, x <= 9.
loopback, -lamd mode
This graph shows a run of the pingpong program
under Linux 2.2.9, using LAM 6.3-b2, using
loopback networking on a dual processor
machine. lamd mode is used, and the graph
shows what we would expect.
loopback, -c2c mode
This graph shows a run of the pingpong program
under Linux 2.2.9, using LAM 6.3-b2, using
loopback networking on a dual processor
machine. c2c mode is used, and the graph shows
a huge performance drop between messages of
size 64 K and messages of size 128K.
10 Mbit/sec, -lamd mode
This graph shows a run of the pingpong program
under Linux 2.2.9, using LAM 6.3-b2, using two
machines on the same 10 Mbit/sec ethernet
segment. lamd mode is used, and the graph
shows what we would expect.
10 Mbit/sec, -c2c mode
This graph shows a run of the pingpong program
under Linux 2.2.9, using LAM 6.3-b2, using
two machines on the same 10 Mbit/sec ethernet
segment. c2c mode is used, and the graph shows
a performance drop between messages of
size 64 K and messages of size 128K. It is not
as noticable as the one in the graph above
that uses loopback networking, mainly because
there is more network overhead to mask out the drop.
TCP Test Program
The TCP Test Program can be compiled to use either
regular read/write function calls, or the readv/writev
calls that LAM uses. Both experience an anomaly at 2K
bytes.
Using readv/writev
Using readv/writev there are two spikes evident
at a message sizes of 2K and 16K bytes. The rest
of the graph follows a predictable curve, but it
takes as long to send a 2K or 16K byte message as it
does to send a 32K byte message.
Using read/write
Using regular read and write, there are still
spikes at messages of size 2K and 16K bytes, but
they are not nearly as large as the spikes when using
readv/writev.
Linux 2.2.10
The 2.2.10 release of Linux apparently included
several fixes and changes to the networking code,
including a fix to the TCPNODELAY flag. This seems to
have fixed some of the bad behavior we witnessed under
2.2.9 and lower.
MPI Test Program
Using our pingpong program we see
that the large performance drop seems to have
disappeared in Linux 2.2.10. However, we see that
lamd and c2c modes have nearly identical performances.
loopback, -lamd mode
This graph shows a run of the pingpong
program under Linux 2.2.10, using LAM 6.3-b2,
using loopback networking on a dual processor
machine. lamd mode is used, and the graph
shows what we would expect.
loopback, -c2c mode
This graph shows a run of the pingpong
program under Linux 2.2.10, using LAM 6.3-b2,
using loopback networking on a dual processor
machine. c2c mode is used, and the
graph shows results very similar to the above
lamd graph, except for very small message
sizes where c2c mode outperforms lamd mode by
a small margin.
10 Mbit/sec, -lamd mode
This graph shows a run of the pingpong
program under Linux 2.2.10, using LAM 6.3-b2,
using two machines on the same 10 Mbit/sec
ethernet segment. lamd mode is used, and the
graph shows what we would expect.
10 Mbit/sec, -c2c mode
This graph shows a run of the pingpong
program under Linux 2.2.10, using LAM 6.3-b2,
using two machines on the same 10 Mbit/sec
ethernet segment. c2c mode is used, and the
graph shows results very similar to the above
lamd graph, except for very small message
sizes where c2c mode outperforms lamd mode by
a small margin.
TCP Test Program
The TCP Test Program can be compiled to use either
regular read/write function calls, or the readv/writev
calls that LAM uses. Both experience an anomaly at 2K
bytes.
Using readv/writev
Using readv/writev a huge spike is evident at a
message size of 2K, with a minor spike at
message size of 16K. The rest of the graph follows
a predictable curve, but it takes as long to send
a 2K byte message as it does to send a 256K byte
message.
Using read/write
Using
regular read and write, there are still spikes
at messages of size 2K and 16K bytes, but they
are not nearly as large as the spike when
using readv/writev.
Related Links
These are mainly links to posts to mailing lists about issues
that might be related to this.
The 2.2 series of the Linux kernel apparently messed up part of
its TCP/IP implementation. This caused performance problems in
the LAM/MPI software. Linux 2.2.10 appears to have fixed this
problem, since LAM performance in 2.2.10 is comparable to LAM
performance under 2.0.36. We have not yet found a way to
explain the weird spikes we see in the TCP test program, even
in 2.2.10, but since LAM behaves properly under 2.2.10, the
LAM Team will probably not try to fix LAM's behaviour on 2.2.x
versions of Linux (x < 10).
The problems experienced by LAM under Linux 2.2.x have also
been experienced in other pieces of software, including MPICH,
another free MPI implementation.
If you have further info you can enlighten us with, or can
suggest ways in which we can further investigate this phenomena,
please contact us.
Hosted on www.lam-mpi.org
This site is located in: Bloomington, IN, USA