LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Carsten Kutzner (ckutzne_at_[hidden])
Date: 2005-11-03 08:15:31


Hello Bogdan,

thank you for your prompt reply!

On Wed, 2 Nov 2005, Bogdan Costescu wrote:

> On Wed, 2 Nov 2005, Carsten Kutzner wrote:
>
> > In our case the congestion happened within the switch.
>
> Can you explain how you came to this conclusion (with as many details
> as possible) ?

I looked a bit deeper into this, and it turned out that not only the
switch drops packets, so my above conclusion has to be corrected.

As a test, I do an MPI_Alltoall with a message size of 32768 byte,
and that for 100 times (barrier-separated). I measure the time each
individual call takes, which is for example 0.006 seconds on 8 CPUs. If an
all-to-all takes significantly (typically 0.25 sec) longer than the
others, it is a strong indication that a packet has been dropped somewhere.
The HP2848 switch has port counters where one can see how many packets
have been dropped at each port. For the nodes, dropped packets should show
up in the output of ifconfig. By comparing the respective values
before and after the 100 all-to-alls, I can see where packets have been
dropped.

drops with flow control:
CPUs switch nodes execution time of one alltoall (min-max)
 4 no no 0.0018-0.0055 sec
 8 no no 0.0054-0.0093 sec
16 no yes 0.0130-0.2564 sec
32 yes yes 0.0273-0.4160 sec

drops without flow control:
CPUs switch nodes execution time of one alltoall
 4 no no 0.0018-0.0022 sec
 8 no no 0.0053-0.2891 sec
16 no no 0.2597-0.5445 sec
32 no no 0.2987-0.7845 sec

This result actually makes me wonder which drops are counted here.
Without flow control there are clearly drops for 8+ CPUs, but they do
not show up in the counters.

> > The performance of the original LAM MPI_Alltoall however remains a
> > bit better for small message sizes. This is similar to what Pierre
> > found for his modified routines.
>
> Then you can try to use the original for small messages and the new
> one for large messages, with some threshold value to switch from one
> to the other.
>
> Due to the common use of the same switch by you and Pierre, maybe it's
> possible to find some "optimized" conditions for this particular piece
> of hardware...

I think I already have a clear picture of where this threshold value is
for different numbers of CPUs involved in the all-to-all. Unfortunately
it depends not only on the number of CPUs, but also on how many CPUs there
are on each node ... tricky if you do not want to optimize the all-to-all
for a very specific cluster. Wouldn't an all-to-all be nicer that is
under all circumstances free of congestion, even if it is slightly slower
for small messages?

Best regards,
   Carsten

---------------------------------------------------
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckutzne_at_[hidden]
http://www.gwdg.de/~ckutzne