On Aug 30, 2004, at 11:32 AM, Peter Kjellstroem wrote:
> I'm trying out lam-7.1b18 on my infiniband test cluster. Generally LAM
> runs fine on this machine but the latency for small packages is not
> comparable to the other MPI-implementation on the cluster (ScaMPI).
>
> observation1: on tcp LAM and ScaMPI are very close, and both push my
> Gig-E
> network to its limits.
Excellent.
> observation2: LAM on ib wins a small victory on bandwidh
Even more excellent. :-)
> observation3: while tcp->ib divides ScaMPI latency by three it only
> reduces LAM latency by 20%
Yes, this is also expected. See below.
> Does anyone know if this is expected behaviour for this beta? Here are
> the
> numbers (and yes, the bandwidth is veeery low for ib (chipset issue)):
From the release notes in the User's Guide:
\subsection{Infiniband \kind{rpi} Module}
-----
The Infiniband (\rpi{ib}) module implementation in LAM/MPI is based on
the IB send/receive protocol for tiny messages and RDMA protocol for
long messages. Future optmizations include allowing tiny messages to
use RDMA (for potentialy latency performance improvements for tiny
messages).
-----
For this release, we decided that small message performance could be
sub-optimal in order to get it a product out the door. Future releases
will include many kinds of performance enhancements.
Sidenote: We literally just found an issue with the gm and ib RPI's in
that applications that MPI_CANCEL receive requests but don't TEST or
WAIT on them will still allow them to be matched (which is actually
legal by the MPI standard). The issue is how we handle the accounting
and internal data structures. We just "fixed" that (by not allowing
them to be matched, but you obviously still need to TEST or WAIT to
release the resources) and are putting them through their paces.
> --- lam, ib ---
>
> [cap_at_n1 mpi]$ /usr/local/lam-7.1.b18-intel/bin/mpirun -np 2
> /home/cap/mpi/mpibench.lam71b18ib_intel
> Using Zero pattern.
> starting lat-bw test.
> Latency: 17.5 µsec Bandwidth: 0.0 bytes/s (0 x 10000)
> Latency: 17.6 µsec Bandwidth: 57.0 kbytes/s (1 x 10000)
> Latency: 17.5 µsec Bandwidth: 114.1 kbytes/s (2 x 10000)
> Latency: 17.5 µsec Bandwidth: 228.0 kbytes/s (4 x 10000)
> Latency: 17.6 µsec Bandwidth: 455.7 kbytes/s (8 x 10000)
> Latency: 17.6 µsec Bandwidth: 909.0 kbytes/s (16 x 10000)
> Latency: 17.7 µsec Bandwidth: 1.8 Mbytes/s (32 x 10000)
> Latency: 17.8 µsec Bandwidth: 3.6 Mbytes/s (64 x 10000)
> Latency: 18.4 µsec Bandwidth: 6.9 Mbytes/s (128 x 10000)
> Latency: 19.8 µsec Bandwidth: 13.0 Mbytes/s (256 x 10000)
> Latency: 22.4 µsec Bandwidth: 22.8 Mbytes/s (512 x 10000)
> Latency: 27.5 µsec Bandwidth: 37.2 Mbytes/s (1024 x 10000)
> Latency: 64.3 µsec Bandwidth: 31.8 Mbytes/s (2048 x 10000)
> Latency: 75.1 µsec Bandwidth: 54.5 Mbytes/s (4096 x 10000)
> Latency: 94.6 µsec Bandwidth: 86.6 Mbytes/s (8192 x 10000)
> Latency: 125.5 µsec Bandwidth: 130.5 Mbytes/s (16384 x 6400)
> Latency: 212.7 µsec Bandwidth: 154.0 Mbytes/s (32768 x 3200)
> Latency: 362.1 µsec Bandwidth: 181.0 Mbytes/s (65536 x 1600)
> Latency: 686.4 µsec Bandwidth: 191.0 Mbytes/s (131072 x 800)
> Latency: 1.3 msec Bandwidth: 199.9 Mbytes/s (262144 x 400)
> Latency: 2.6 msec Bandwidth: 203.2 Mbytes/s (524288 x 200)
> Latency: 5.1 msec Bandwidth: 205.3 Mbytes/s (1048576 x 100)
> --- scampi, ib ---
>
> [cap_at_n1 mpi]$ /opt/scali/bin/mpimon mpibench.scampi -- n1 n2
> starting lat-bw test.
> Latency: 6.8 µsec Bandwidth: 0.0 bytes/s (0 x 10000)
> Latency: 6.8 µsec Bandwidth: 146.1 kbytes/s (1 x 10000)
> Latency: 6.8 µsec Bandwidth: 292.3 kbytes/s (2 x 10000)
> Latency: 6.8 µsec Bandwidth: 587.1 kbytes/s (4 x 10000)
> Latency: 7.0 µsec Bandwidth: 1.2 Mbytes/s (8 x 10000)
> Latency: 7.1 µsec Bandwidth: 2.3 Mbytes/s (16 x 10000)
> Latency: 7.3 µsec Bandwidth: 4.4 Mbytes/s (32 x 10000)
> Latency: 7.7 µsec Bandwidth: 8.3 Mbytes/s (64 x 10000)
> Latency: 8.5 µsec Bandwidth: 15.0 Mbytes/s (128 x 10000)
> Latency: 10.3 µsec Bandwidth: 24.9 Mbytes/s (256 x 10000)
> Latency: 13.9 µsec Bandwidth: 36.7 Mbytes/s (512 x 10000)
> Latency: 19.6 µsec Bandwidth: 52.3 Mbytes/s (1024 x 10000)
> Latency: 27.8 µsec Bandwidth: 73.7 Mbytes/s (2048 x 10000)
> Latency: 39.3 µsec Bandwidth: 104.2 Mbytes/s (4096 x 10000)
> Latency: 64.0 µsec Bandwidth: 127.9 Mbytes/s (8192 x 10000)
> Latency: 110.1 µsec Bandwidth: 148.8 Mbytes/s (16384 x 6400)
> Latency: 202.4 µsec Bandwidth: 161.9 Mbytes/s (32768 x 3200)
> Latency: 378.4 µsec Bandwidth: 173.2 Mbytes/s (65536 x 1600)
> Latency: 720.8 µsec Bandwidth: 181.8 Mbytes/s (131072 x 800)
> Latency: 1.4 msec Bandwidth: 186.4 Mbytes/s (262144 x 400)
> Latency: 2.8 msec Bandwidth: 188.8 Mbytes/s (524288 x 200)
> Latency: 5.5 msec Bandwidth: 189.9 Mbytes/s (1048576 x 100)
I'm impressed that our bandwdith on the high-end messages is ~15MB
better than scaMPI. Woo hoo!
Remember that bandwidth numbers aren't hugely relevant for small
messages -- while all metrics are important, my own $0.02 is that for
small messages, it's hard/impossible for division by such small numbers
to have important meaning. In this case, the latency is what matters
-- and we definitely made the conscious choice to use the IB send/recv
model for small messages for this release. Although RDMA for small
messages is faster, there are a lot of implementation issues involved
(e.g., scalability) which we intend to look into for future releases.
All that being said, sure, 6us vs. 17us for 0 byte latency is a big
difference -- but depending on your application, that may or may not
matter. A wide variety of applications really won't care about the
difference. This won't prevent me from saying all kinds of Good Things
when we get our latency down ;-), but I just want to emphasize that our
decision for higher latency in this release is actually not that bad
for most applications.
Hope that helps.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|