LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pierre Valiron (Pierre.Valiron_at_[hidden])
Date: 2005-08-17 08:50:41


Jeff Squyres wrote:

>>3) developing a preliminary optimized alltoall code
>>
>>In order to limit the concurrency per interface, the buffers should be
>>exchanged in an orderly fashion, with a single buffer being read and
>>written at a time through a given interface.
>>
>>The construction of such an orderly concurrency schedule is not
>>obvious. Ralf Wildenhues (U. Bonn, Germany) suggested me to use his
>>own reimplementation of a timetable algorithm for overlapping all2all
>>here, which is like the one in fftw but much shorter. * @see
>>fftw-2.1.5/mpi/sched.c for details and the original algorithm
>>* @see J. A. M. Schreuder, "Constructing Timetables for Sport
>>Competitions",
>>* Mathematical Programming Study 13, pp. 58-67 (1980),
>>* for some theory
>>
>>
>>In my implementation, each processor initiates the exchange by calling
>>MPI_Isend and MPI_Irecv, then waits for completion by calling
>>MPI_Waitall on both Isend and Irecv.
>>
>>
>
>Note that you might get slightly better performance if you use blocking
>sends/receives -- i.e., the lower-ranked process does an MPI_Send
>followed by an MPI_Recv, and the higher-ranking process does an
>MPI_Recv followed by an MPI_Send. This will definitely be true for
>short messages (because short messages are sent eagerly), but the
>optimization may lose out as messages get larger, and the dual
>bandwidth effect of simultaneous sends/receives may be a bigger win.
>
>
>
Thanks to Ralf for his comment concerning the scheduling with an odd
number of procs. I checked everything still works fine in this case,
with very little degradation of performance.

I also tried above suggestion from Jeff. It actually *degrades* the
latency from 3 ms to about 5 ms for 22 procs. I can restore the 3 ms
latency by replacing the recvs by I_recvs.

I have then tried a more involved algorithm which increases the
concurrency. Instead of exchanging a single buffer at a time, the code
posts one extra transfer and waits for the previous one instead of the
current one:

         do k=1,nprocs
c skip copy to self
            if (liste(k)-1.eq.me) cycle
c buffer (me, k-1) to be exchanged with (k-1, me)
            if (me.lt.liste(k)-1) then
               call MPI_Irecv(rbuf(1,liste(k)), n, MPI_DOUBLE_PRECISION,
liste(k)-1, 200, comm, req, err)
               call MPI_Send(sbuf(1,liste(k)), n, MPI_DOUBLE_PRECISION,
liste(k)-1, 200, comm, err)
            else
               call MPI_Send(sbuf(1,liste(k)), n, MPI_DOUBLE_PRECISION,
liste(k)-1, 200, comm, err)
               call MPI_Irecv(rbuf(1,liste(k)), n, MPI_DOUBLE_PRECISION,
liste(k)-1, 200, comm, req, err)
            endif
            if (k.gt.1) call MPI_Wait(reqold, stat, err)
            reqold=req
         enddo
         call MPI_Wait(req, stat, err)

This trick reduces the latency from 3 ms to 2 ms for 22 procs.

The bandwidth is also augmented for larger buffers, up to 180 MB/s for
32K to 64 K buffers.

It is then easy to keep these good figures for larger buffers by
performing the alltoall by smaller chunks, keeping always the chunk size
within the "small message" limit.

The resulting typical performance is the following:

 NPROCS 22 ALGO 4
buf_size, sent/node, iter_time (s), rate, rate/node (MB/s)
        8 100 0.002044 3.448 0.157
       16 100 0.002001 7.045 0.320
       32 100 0.002001 14.092 0.641
       64 100 0.002067 27.288 1.240
      128 100 0.002073 54.404 2.473
      256 100 0.002199 102.569 4.662
      512 100 0.002159 208.926 9.497
     1024 10 0.002451 368.130 16.733
     2048 10 0.002616 689.739 31.352
     4096 10 0.002976 1212.803 55.127
     8192 10 0.003690 1956.185 88.917
    16384 10 0.004888 2953.944 134.270
    32768 10 0.007320 3944.660 179.303
    65536 10 0.016330 3536.435 160.747
   131072 10 0.030389 3800.712 172.760
   262144 10 0.058120 3974.562 180.662
   524288 10 0.114327 4041.051 183.684
  1048576 2 0.236942 3899.690 177.259
  2097152 2 0.469493 3936.162 178.916
  4194304 2 0.916846 4031.211 183.237
  8388608 2 1.808815 4086.653 185.757

These figures seem excellent for buffers beyond 4 KB, and remain similar
for an odd number of processors.

However, for smaller buffers, plain LAM remains better by 0.8 - 1 ms,
with a typical latency of 1.2 ms up to 1 KB and 2 ms for 4 KB.

>I'd try the suggestions that I outlined above, and if the original LAM
>algorithm still wins, then let's put an "if" statement around it that
>swiches off on the message size. We can certainly switch algorithms
>based on data sizes -- I think it's a well-known fact that one
>algorithm definitely does not fit all data sizes.
>
>
I agree one should consider the message size. However I suspect other
parameters should be considered as well.

As far as I can understand, the advantage of LAM for small buffers is
its large concurrency which helps to reduce the latency.

However one should also consider the number of procs, as the total
number of messages queued on a given proc should not overflow the system
buffers. This is an additional constrain on the level of concurrency
permitted on a given system.

So my suggestions for developing an optimal algorithm are the following :

1) Use "small messages". For larger buffers and uniprocessors, perform
the alltoall by looping over smaller chunks.

* This option might be beneficial for the other collectives as well... *

2) Adjust the level of concurrency having in mind the system limits.

Beyond 64 K two concurrent sends and recvs are sufficient to saturate
the full-duplex bandwidth.

For smaller buffers a higher concurrency would probably increase the
performance. However the full concurrency as implemented in the original
LAM algorithm is dangerous over a dozen of nodes and may require
delicate system and network tuning to avoid freezing consitions and
instabilities. This full concurrency is even more dangerous in the case
of multiprocessor systems.

It might thus be desirable to devise a more general algorithm with an
adjustable concurrency level.

3) Investigate also the multiprocessor case.

The actual concurrency on the network interfaces is augmented for
multiprocessors, as the network insterface is shared by several
processors. Actually I experimented that the above algorithm with a
concurrency level of two and 64 K chunks is no more optimal on 22
quadriprocessors (88 procs).

A higher (close to optimal) performance is restored either by reducing
the chunk size to 16 K or by keeping large buffers *and* reducing the
concurrency level to one.

It is thus unrealistic to seek for a unique optimal algorithm. I also
suspect that some tuning on a given architecture cannot be avoided to
determine the optimal chunk size, concurrency level, etc, for various
production topologies. Ideally, the LAM library should be also aware of
the system and IP bottlenecks to avoid freezing the application on large
clusters.

Hoping this preliminary investigation might be also beneficial for the
Open-MPI project.

All the best.
Pierre Valiron.

-- 
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
       _/_/_/_/    _/       _/       Dr. Pierre VALIRON
      _/     _/   _/      _/   Laboratoire d'Astrophysique
     _/     _/   _/     _/    Observatoire de Grenoble / UJF
    _/_/_/_/    _/    _/    BP 53  F-38041 Grenoble Cedex 9 (France)
   _/          _/   _/    http://www-laog.obs.ujf-grenoble.fr/~valiron/
  _/          _/  _/     Mail: Pierre.Valiron_at_[hidden]
 _/          _/ _/      Phone: +33 4 7651 4787  Fax: +33 4 7644 8821
_/          _/_/