LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bogdan Costescu (Bogdan.Costescu_at_[hidden])
Date: 2005-08-17 10:38:26


On Wed, 17 Aug 2005, Pierre Valiron wrote:

> As far as I can understand, the advantage of LAM for small buffers is
> its large concurrency which helps to reduce the latency.

I'm normally very interested in reducing latency, but I'll play
devil's advocate for once :-)

I'll hazard a guess that if you increase the amount of polling
(generally speaking) you'll get an even lower latency, but at what
cost ? How much are you willing to "pay" for a lower latency ? What if
the MPI application is not well balanced and on a node with 4 ranks
one finishes significantly faster than the others and enters the
all-to-all routine ? It will then start polling and additionally
slowing down the other ranks that were already late.

> However one should also consider the number of procs, as the total
> number of messages queued on a given proc should not overflow the
> system buffers. This is an additional constrain on the level of
> concurrency permitted on a given system.

This kind of balance is present in the kernel as well. Using large
buffers, as few buffer segmentations, pinning the interrupt line to a
CPU (in a SMP machine) and sending with a low CPU usage (using TCP
Segmentation Offload, for example on Broadcom 57xx and Intel E1000
cards) will result in the best single stream TCP performance. But the
same conditions might not produce the best results for multiple stream
TCP, which is basically what all-to-all is currently about in LAM/MPI
- which means that on top of a kernel optimized for single stream TCP
performance you can do whatever you want but you might not be able to
get good performance for multiple stream TCP performance. So ideally
the tuning should be done on the system as a whole, not only on
individual parts...

> Beyond 64 K two concurrent sends and recvs are sufficient to saturate
> the full-duplex bandwidth.

...iff the other sides are also sending and receiving at full speed.
Consider either the case mentioned above when one rank polls or blocks
or another case when the transmission is possible, but slower. By
limiting to only these 2 concurrent transfers, you effectively
increase the time needed for all the data to be transmitted to the
other ranks as the maximum system transfer speed is not reached. Such
a situation can possibly be prevented or the impact lessened by using
a barrier and thus making sure that all ranks start transmitting at
about the same time, but an unconditional barrier at the beginning of
an all-to-all routine might reduce performance in cases where the
contention is not so high.

> This full concurrency is even more dangerous in the case of
> multiprocessor systems.

I agree, but from a rather different point of view. Sharing a network
interface between several processes is bad for performance in general
with multiple TCP streams, as the processor caches will be often
thrashed and common kernel variables (like locks) will generate
contention, especially when receiving.

> It is thus unrealistic to seek for a unique optimal algorithm.

That's probably true for most algorithms, but I've been happy with the
LAM/MPI developers' choice until now :-)

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]