Jeff Squyres wrote:
>>What is the status of the other collective operations in lam/mpi ? Is
>>this "explosive" behaviour unique to alltoall ?
>>
>>
>
>For the most part, yes. Most other algorithms use a logarithmic
>approach. But alltoall is quite difficult to optimize (there is other
>work in this area, but it has not been ported to LAM/MPI).
>
>
Fine.
Thus MPI_Alltoall is the most challenging collective primitive to tackle
with...
>
>
>>I could imagine easily how to write another naive alltoall on top of
>>Isend and Irecv which would limit the number of pending requets to a
>>few
>>per node with no serious performance pay-off. Is this kind of "safe"
>>algorithm has already been written by MPI gurus ? Is it planned for
>>OpenMPI ? If not, I am willing to write a demonstration code.
>>
>>
>
>I should mention that both LAM and Open MPI use a component
>architecture for their collective algorithm implementations. As such,
>it's quite easy to drop in a new algorithm.
>
>We have not had the time to write out a better alltoall [yet]; if you
>wanted to do a little work in this area, that would be fantastic.
>
>Probably the easiest thing to do would be to simply replace the current
>alltoall algorithm -- it's in
>share/ssi/coll/lam_basic/src/ssi_coll_basic_lam_basic_alltoall.c
>(ignore the *_lamd() function). The code is relatively simply -- it
>just creates a bunch of persistent requests, issues MPI_Startall(), and
>then MPI_Waitall(). The rationale for using persistent requests was to
>create all the setup first and *then* initiate all the communication
>(i.e., don't mix the setup with potential communication delays). This
>could be a moot point, however, especially for a better algorithm.
>
>Any work that you do here will also be pretty much directly applicable
>to Open MPI (its first generation collective component framework is
>almost identical to LAM's).
>
>If you configure LAM/MPI with --enable-shared --disable-static, then
>you can easily re-build/re-install the lam_basic coll component
>directly from the share/ssi/coll/lam_basic directory (i.e., a simple
>"make all install" in there) rather than having to re-link the entire
>libmpi MPI library itself. This tends to save a lot of time during
>debugging (you should probably "make uninstall" the static installation
>first, however, to prevent confusion between the static and dynamic
>installs of LAM/MPI).
>
>Let us know what you find.
>
>
Dear Jeff,
I feel unable to tweak the LAM source, C language is by far too foreign
for me.
However I have performed some experiments with Fortran code after
enabling hardware flow control on the gigabit interfaces.
1) running LAM MPI_Alltoall for increased buffer sizes.
I attach in file out.eps a log-log plot providing the elapsed time (in
seconds) as a function of the buffer size in bytes for 4, 8, 12, 16, 20,
22 processors (1 proc per node). For very small buffers, the time is
weakly dependent on the buffer size, and slowly increases with the
number of procs. In this regime the time is presumably dominated by
network latencies as it should, with the exception of some small
anomalies for 22 procs. Alternatively in the limit of very large buffer
sizes the time grows linearly with the buffer size and the aggregate
transfer rate per interface approaches the hardware limit for
full-duplex gigabit (about 180 MB/s). Thus for these two limits the LAM
built-in algorithm performs well.
However when enough processors are involved (more than 10) a
catastrophic intermediate regime shows up presumably due to network
contention. I can understand why this contention is less severe for
small buffers, which may fit in IP and TCP stacks, however I still don't
understand why this contention has less effet for larger buffers. One
can even see that the elapsed time is *reduced* when the buffer size is
augmented beyond the intermediate regime. In the limit of very large
buffers, all gets dominated by physical transfer time, and
collision/contention times have less importance.
2) writing a crude alltoall code to send a buffer at a time from each
processor
on each processor
- call Irecv from all processors
- call Isend a single processor at a time, looping over MPI_Isend and
MPI_Waitany(all Irecv requests)
This Alltoall code is slower that original LAM's one in the limit of
very small or very large buffers, however it behaves smoothly for all
intermediate buffer sizes.
This experiment supports the hypothesis of network contention as the
main bottleneck source.
3) developing a preliminary optimized alltoall code
In order to limit the concurrency per interface, the buffers should be
exchanged in an orderly fashion, with a single buffer being read and
written at a time through a given interface.
I only considered so far the case of a single active processor per node.
In the case of a large number of n-processors, the intra-processor
communications (which are located in a fraction of a n-band of the
alltoall matrix) are small with respect of the inter-processor
communications [in the approximate ratio is n/(2*nprocs)]. Thus in the
n-processor case an algorith ordered with respect to a uniprocessor case
would result into n simultaneous Recvs and Sends per interface, which
might be not optimal. Additional thinking seems required in the
n-processor case, especially if different numbers of processors are
involved on the nodes...
The construction of such an orderly concurrency schedule is not obvious. Ralf Wildenhues (U. Bonn, Germany) suggested me to use his own reimplementation of a timetable algorithm for overlapping all2all here, which is like the one in fftw but much shorter.
* @see fftw-2.1.5/mpi/sched.c for details and the original algorithm
* @see J. A. M. Schreuder, "Constructing Timetables for Sport Competitions",
* Mathematical Programming Study 13, pp. 58-67 (1980),
* for some theory.
There are some subtelties in the case of an odd number of processors I have not fully understood for the moment. However Ralf's algorithm is straighforward to implement for an even number of processors. Each processor is supposed to exchange buffers step by step. The algorithm provides the target processor number as a function of the rank and step. Step 0 is self-exchange. Example for 6 processors is the following:
$ ../alltoall-sched/a.out 6
Computing schedule for npes = 6:
schedule OK (takes 6 steps to complete).
pe 0 schedule: 0 5 2 4 1 3
pe 1 schedule: 1 4 5 3 0 2
pe 2 schedule: 2 3 0 5 4 1
pe 3 schedule: 3 2 4 1 5 0
pe 4 schedule: 4 1 3 0 2 5
pe 5 schedule: 5 0 1 2 3 4
In the first non-trivial step, buffers are exchanged between (0,5),
(1,4) and (2,3) processor pairs, then between (0,2), (1,5) and ((3,4)
pairs, and similarly for the remaining steps.
In my implementation, each processor initiates the exchange by calling
MPI_Isend and MPI_Irecv, then waits for completion by calling
MPI_Waitall on both Isend and Irecv.
The code performs well for all buffer sizes, as illustrated for 22 procs:
> mpirun N a.out
NPROCS 22
buf_size, sent/node, iter_time (s), rate, rate/node (MB/s)
8 100 0.003130 2.252 0.102
16 100 0.003266 4.317 0.196
32 100 0.003145 8.965 0.408
64 100 0.003239 17.411 0.791
128 100 0.003195 35.300 1.605
256 100 0.003310 68.150 3.098
512 100 0.003398 132.783 6.036
1024 10 0.003798 237.583 10.799
2048 10 0.004040 446.741 20.306
4096 10 0.004741 761.368 34.608
8192 10 0.005888 1226.028 55.729
16384 10 0.008589 1680.872 76.403
32768 10 0.014805 1950.382 88.654
65536 10 0.042751 1350.861 61.403
131072 10 0.058570 1971.998 89.636
262144 10 0.105964 2179.982 99.090
524288 10 0.200738 2301.506 104.614
1048576 2 0.391205 2361.930 107.360
2097152 2 0.766662 2410.448 109.566
4194304 2 1.514088 2441.074 110.958
8388608 2 3.014646 2452.029 111.456
16777216 2 6.010918 2459.524 111.797
However he original LAM's code remains faster in the limit of very small
or vary large buffers as illustrated in the figure new.eps.
Also, the new algorithm remains stable when all processors are used (4
procs per node):
> mpirun C a.out
NPROCS 88
buf_size, sent/node, iter_time (s), rate, rate/node (MB/s)
8 100 0.019322 6.046 0.069
16 100 0.019621 11.908 0.135
32 100 0.019335 24.168 0.275
64 100 0.019306 48.408 0.550
128 100 0.019652 95.111 1.081
256 100 0.019927 187.598 2.132
512 100 0.020507 364.583 4.143
1024 10 0.021309 701.738 7.974
2048 10 0.025843 1157.234 13.150
4096 10 0.026929 2221.136 25.240
8192 10 0.035265 3392.135 38.547
16384 10 0.059175 4043.095 45.944
32768 10 0.111778 4280.811 48.646
65536 10 0.296268 3230.178 36.707
131072 10 0.688215 2781.106 31.603
262144 10 1.429268 2678.294 30.435
524288 10 2.677655 2859.218 32.491
1048576 2 5.070475 3019.835 34.316
2097152 2 9.960572 3074.522 34.938
4194304 2 20.421967 2999.123 34.081
8388608 2 40.559716 3020.140 34.320
In the multi-processor case the aggregate bandwidth proves actually
quite good for large buffers, up to about 120 MB/s aggregated bandwidth
per interface.
These results seem promising.
I would appreciate any additional suggestions from the list to improve
this experimental algoritm and make it comparable or faster than the
original LAM one for all situations.
Cheers.
--
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
_/_/_/_/ _/ _/ Dr. Pierre VALIRON
_/ _/ _/ _/ Laboratoire d'Astrophysique
_/ _/ _/ _/ Observatoire de Grenoble / UJF
_/_/_/_/ _/ _/ BP 53 F-38041 Grenoble Cedex 9 (France)
_/ _/ _/ http://www-laog.obs.ujf-grenoble.fr/~valiron/
_/ _/ _/ Mail: Pierre.Valiron_at_[hidden]
_/ _/ _/ Phone: +33 4 7651 4787 Fax: +33 4 7644 8821
_/ _/_/
- application/postscript attachment: out.eps
- application/postscript attachment: new.eps
|