Jeff Squyres wrote:
>On Aug 11, 2005, at 9:13 AM, Pierre Valiron wrote:
>
>
>
>>I am experimenting the MPI_Alltoall performance on a Gigabit cluster
>>of 30 quadriprocessor nodes (Sun v40z running Solaris 10). This seems
>>a unique configuration in Europe and we are still lacking updated
>>manuals for all the subtelties of Solaris 10. As far as I can guess,
>>the future SunMPI might rely on OpenMPI, so it makes sense to work in
>>the meanwhile with LAM/MPI. I don't need to advertise LAM/MPI any more
>>on this list either ;-).
>>
>>I am using LAM/MPI 7.1.1 compiled using the Sun Studio 10 compilers.
>>Simple MPI codes run fine over all the nodes and cpus. Complex MPI
>>codes are also running on individual 4-way nodes with excellent
>>performances.
>>
>>Now I am using a crude Alltoall benchmark to stress the whole cluster
>>and identify the bottlenecks and problems, and I hope I'll get some
>>useful feedback from this list.
>>
>>- Using default setup for ethernet cards and for the HP ProCurve 2848
>>results in poor performances and even in freezing the bench when the
>>number of nodes is augmented.
>>
>>
>
>Yikes. This scares me -- there's no reason that correct tests should
>freeze. Can you verify that all the hardware and TCP stacks are
>working correctly?
>
>
Hard to be sure for a brand new hardware. However it produces no error
messages and proved rock stable for all other experiments including
intensive NFS transfers over the net.
I have written today a small code to rotate buffers along a ring of
processors. At each step all processors pass a buffer to "rank"+1 and
reads from "rank"-1. This code does not freeze and works nicely on 25
nodes and either 25 processors (1 per node) or 100 processors (4 per
node) using either rpi tcp or rpi usysv. The bandwidth per node
increases nicely with buffer size, to a limit of about 190 MB/s (95 MB/s
for sends and same for recvs) close to the physical gigabit bandwith.
>
>
>>- We then forced the speed to 1000-fdx and enabled flow control on the
>>switch. This resulted in a big performance improvement and reduced
>>the occurence of freezing issues with a large number of nodes.
>>
>>
>
>Reduced or eliminated?
>
>
Reduced...
>
>
>>So far, so good. However I come to performance issues.
>>
>>I illustrate with an example using 13 nodes (the slice size is
>>expressed in bytes).
>>
>>
>
>[snipped the performance numbers]
>
>Note that alltoall is quite a heavy test -- it sends data from each
>process to each other process. This can cause a massive amount of data
>flow. LAM/MPI also has a rather naieve implementation of the Alltoall
>algorithm: in each process, we just start a send to every other peer
>process in the communicator and then MPI_Waitall() to let them progress
>and complete.
>
>The OS and the switch are therefore going to have a lot to do with
>performance of this algorithm. Since LAM just throws all the data up
>in the air, it's relying on both the OS And the switch to sort it all
>out and make sure that all the data gets to where it's supposed to go.
>This can cause lots and lots of congestion, as you've noticed (it isn't
>much of a factor for short messages, as you've noticed).
>
>
I agree the freeze might just be related to OS and TCP stack congestion
when too many data packets are thrown in the air. I'll try to get more
informations on the Solaris10 kernel to know how to tune the TCP stack.
The alltoall implementation you describe seems a bit hairy... If I
understand it properly it is by no way scalable. The total number of
data packets to be exchanged grow as Nproc^2, sure, but in this naive
implementation there no linear boundary on the number of pending
packets. If you have 100 processors or more it seems very risky to
scatter 10000 (large) packets or more in one shot across the system.
As I understand it, setting gigabit hardware flow control may help the
OS to limit the packet explosion. This would explain why the MPI machine
is less prone to freeze in this case.
What is the status of the other collective operations in lam/mpi ? Is
this "explosive" behaviour unique to alltoall ?
I could imagine easily how to write another naive alltoall on top of
Isend and Irecv which would limit the number of pending requets to a few
per node with no serious performance pay-off. Is this kind of "safe"
algorithm has already been written by MPI gurus ? Is it planned for
OpenMPI ? If not, I am willing to write a demonstration code.
>
>
>>From above numbers one can draw some conclusions.
>>
>>- except in the largest case, small slice sizes result in small
>>elapsed time, seemingly dominated by gigabit latencies.
>>- increasing the slice size improves the throughput with little impact
>>on elapsed time until some limiting value as indicated in table below:
>>nodes, procs, slice limit
>>13, 13, 16384
>>13, 52, 16384
>>23, 23, 4096
>>23, 92, 512 or 1024
>>
>>Beyond this slice limiting value, the elapsed time increases first by
>>1 or 2 orders of magnitude and then generally *reduces* when the slice
>>size is augmented... Excepted for the biggest case, very good
>>performances are restored for very large slice sizes.
>>
>>
>
>I'm guessing that this has to be related to how the OS and switch are
>handling the data transfers.
>
>
Agreed.
>
>
>>The behaviour seems erratic. I also experimented on another cluster
>>(linux, 13 bi-xeon nodes). The trend is similar, however the limiting
>>slice limits are different. I see no logic in this behaviour and I
>>suspect some bottleneck in the kernel IP stack, LAM/MPI or both. The
>>freezing issues and the beneficial role of enabling flow control on
>>gigabit let me suspect primarily IP congestion problems at system
>>level. However LAM/MPI has also some performances issues when the
>>message size goes beyond the "small message" 64K limit, and might also
>>be unfair with respect to IP bandwidth saturation.
>>
>>
>
>You're right -- LAM defaults to switching between short and long
>message protocols (i.e., eager vs. rendezvous) at 64k. You can change
>this value by changing the SSI parameter rpi_tcp_short to a larger
>value (e.g., 128k, 256k, etc.). For GigE, I would suspect that 64k is
>too low (even for normal point-to-point sends). Try bumping it up and
>look at normal point-to-point latency/bandwidth values. Then try your
>alltoall test again.
>
>
This is a good suggestion. I'll do it.
Many thanks for your fast answer.
Pierre.
--
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
_/_/_/_/ _/ _/ Dr. Pierre VALIRON
_/ _/ _/ _/ Laboratoire d'Astrophysique
_/ _/ _/ _/ Observatoire de Grenoble / UJF
_/_/_/_/ _/ _/ BP 53 F-38041 Grenoble Cedex 9 (France)
_/ _/ _/ http://www-laog.obs.ujf-grenoble.fr/~valiron/
_/ _/ _/ Mail: Pierre.Valiron_at_[hidden]
_/ _/ _/ Phone: +33 4 7651 4787 Fax: +33 4 7644 8821
_/ _/_/
|