On Aug 16, 2005, at 1:19 PM, Pierre Valiron wrote:
> I feel unable to tweak the LAM source, C language is by far too
> foreign for me.
Gotcha.
> However I have performed some experiments with Fortran code after
> enabling hardware flow control on the gigabit interfaces.
>
> 1) running LAM MPI_Alltoall for increased buffer sizes.
>
> I attach in file out.eps a log-log plot providing the elapsed time (in
> seconds) as a function of the buffer size in bytes for 4, 8, 12, 16,
> 20, 22 processors (1 proc per node). For very small buffers, the time
> is weakly dependent on the buffer size, and slowly increases with the
> number of procs. In this regime the time is presumably dominated by
> network latencies as it should, with the exception of some small
> anomalies for 22 procs. Alternatively in the limit of very large
> buffer sizes the time grows linearly with the buffer size and the
> aggregate transfer rate per interface approaches the hardware limit
> for full-duplex gigabit (about 180 MB/s). Thus for these two limits
> the LAM built-in algorithm performs well.
>
> However when enough processors are involved (more than 10) a
> catastrophic intermediate regime shows up presumably due to network
> contention. I can understand why this contention is less severe for
> small buffers, which may fit in IP and TCP stacks, however I still
> don't understand why this contention has less effet for larger
> buffers. One can even see that the elapsed time is *reduced* when the
> buffer size is augmented beyond the intermediate regime. In the limit
> of very large buffers, all gets dominated by physical transfer time,
> and collision/contention times have less importance.
>
> 2) writing a crude alltoall code to send a buffer at a time from each
> processor
>
> on each processor
> - call Irecv from all processors
> - call Isend a single processor at a time, looping over MPI_Isend and
> MPI_Waitany(all Irecv requests)
>
> This Alltoall code is slower that original LAM's one in the limit of
> very small or very large buffers, however it behaves smoothly for all
> intermediate buffer sizes.
>
> This experiment supports the hypothesis of network contention as the
> main bottleneck source.
Good stuff. I have heard similar stories from others who have worked
on optimizing collectives -- message size is something that LAM
currently does not use as a factor for determining which algorithm to
use (and therefore the 1st generation Open MPI collectives don't either
-- they were pretty much directly derived from LAM).
> 3) developing a preliminary optimized alltoall code
>
> In order to limit the concurrency per interface, the buffers should be
> exchanged in an orderly fashion, with a single buffer being read and
> written at a time through a given interface.
>
> I only considered so far the case of a single active processor per
> node. In the case of a large number of n-processors, the
> intra-processor communications (which are located in a fraction of a
> n-band of the alltoall matrix) are small with respect of the
> inter-processor communications [in the approximate ratio is
> n/(2*nprocs)]. Thus in the n-processor case an algorith ordered with
> respect to a uniprocessor case would result into n simultaneous Recvs
> and Sends per interface, which might be not optimal. Additional
> thinking seems required in the n-processor case, especially if
> different numbers of processors are involved on the nodes...
>
> The construction of such an orderly concurrency schedule is not
> obvious. Ralf Wildenhues (U. Bonn, Germany) suggested me to use his
> own reimplementation of a timetable algorithm for overlapping all2all
> here, which is like the one in fftw but much shorter. * @see
> fftw-2.1.5/mpi/sched.c for details and the original algorithm
> * @see J. A. M. Schreuder, "Constructing Timetables for Sport
> Competitions",
> * Mathematical Programming Study 13, pp. 58-67 (1980),
> * for some theory.
>
> There are some subtelties in the case of an odd number of processors I
> have not fully understood for the moment. However Ralf's algorithm is
> straighforward to implement for an even number of processors. Each
> processor is supposed to exchange buffers step by step. The algorithm
> provides the target processor number as a function of the rank and
> step. Step 0 is self-exchange. Example for 6 processors is the
> following:
>
> $ ../alltoall-sched/a.out 6
> Computing schedule for npes = 6:
> schedule OK (takes 6 steps to complete).
> pe 0 schedule: 0 5 2 4 1 3
> pe 1 schedule: 1 4 5 3 0 2
> pe 2 schedule: 2 3 0 5 4 1
> pe 3 schedule: 3 2 4 1 5 0
> pe 4 schedule: 4 1 3 0 2 5
> pe 5 schedule: 5 0 1 2 3 4
> In the first non-trivial step, buffers are exchanged between (0,5),
> (1,4) and (2,3) processor pairs, then between (0,2), (1,5) and ((3,4)
> pairs, and similarly for the remaining steps.
>
> In my implementation, each processor initiates the exchange by calling
> MPI_Isend and MPI_Irecv, then waits for completion by calling
> MPI_Waitall on both Isend and Irecv.
Note that you might get slightly better performance if you use blocking
sends/receives -- i.e., the lower-ranked process does an MPI_Send
followed by an MPI_Recv, and the higher-ranking process does an
MPI_Recv followed by an MPI_Send. This will definitely be true for
short messages (because short messages are sent eagerly), but the
optimization may lose out as messages get larger, and the dual
bandwidth effect of simultaneous sends/receives may be a bigger win.
> The code performs well for all buffer sizes, as illustrated for 22
> procs:
>
> > mpirun N a.out
> NPROCS 22
> buf_size, sent/node, iter_time (s), rate, rate/node (MB/s)
> 8 100 0.003130 2.252 0.102
> 16 100 0.003266 4.317 0.196
> 32 100 0.003145 8.965 0.408
> 64 100 0.003239 17.411 0.791
> 128 100 0.003195 35.300 1.605
> 256 100 0.003310 68.150 3.098
> 512 100 0.003398 132.783 6.036
> 1024 10 0.003798 237.583 10.799
> 2048 10 0.004040 446.741 20.306
> 4096 10 0.004741 761.368 34.608
> 8192 10 0.005888 1226.028 55.729
> 16384 10 0.008589 1680.872 76.403
> 32768 10 0.014805 1950.382 88.654
> 65536 10 0.042751 1350.861 61.403
> 131072 10 0.058570 1971.998 89.636
> 262144 10 0.105964 2179.982 99.090
> 524288 10 0.200738 2301.506 104.614
> 1048576 2 0.391205 2361.930 107.360
> 2097152 2 0.766662 2410.448 109.566
> 4194304 2 1.514088 2441.074 110.958
> 8388608 2 3.014646 2452.029 111.456
> 16777216 2 6.010918 2459.524 111.797
>
> However he original LAM's code remains faster in the limit of very
> small or vary large buffers as illustrated in the figure new.eps.
>
> Also, the new algorithm remains stable when all processors are used (4
> procs per node):
>
> > mpirun C a.out
> NPROCS 88
> buf_size, sent/node, iter_time (s), rate, rate/node (MB/s)
> 8 100 0.019322 6.046 0.069
> 16 100 0.019621 11.908 0.135
> 32 100 0.019335 24.168 0.275
> 64 100 0.019306 48.408 0.550
> 128 100 0.019652 95.111 1.081
> 256 100 0.019927 187.598 2.132
> 512 100 0.020507 364.583 4.143
> 1024 10 0.021309 701.738 7.974
> 2048 10 0.025843 1157.234 13.150
> 4096 10 0.026929 2221.136 25.240
> 8192 10 0.035265 3392.135 38.547
> 16384 10 0.059175 4043.095 45.944
> 32768 10 0.111778 4280.811 48.646
> 65536 10 0.296268 3230.178 36.707
> 131072 10 0.688215 2781.106 31.603
> 262144 10 1.429268 2678.294 30.435
> 524288 10 2.677655 2859.218 32.491
> 1048576 2 5.070475 3019.835 34.316
> 2097152 2 9.960572 3074.522 34.938
> 4194304 2 20.421967 2999.123 34.081
> 8388608 2 40.559716 3020.140 34.320
>
> In the multi-processor case the aggregate bandwidth proves actually
> quite good for large buffers, up to about 120 MB/s aggregated
> bandwidth per interface.
>
> These results seem promising.
>
> I would appreciate any additional suggestions from the list to improve
> this experimental algoritm and make it comparable or faster than the
> original LAM one for all situations.
I'd try the suggestions that I outlined above, and if the original LAM
algorithm still wins, then let's put an "if" statement around it that
swiches off on the message size. We can certainly switch algorithms
based on data sizes -- I think it's a well-known fact that one
algorithm definitely does not fit all data sizes.
How does that sound?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|