On Aug 11, 2005, at 9:13 AM, Pierre Valiron wrote:
> I am experimenting the MPI_Alltoall performance on a Gigabit cluster
> of 30 quadriprocessor nodes (Sun v40z running Solaris 10). This seems
> a unique configuration in Europe and we are still lacking updated
> manuals for all the subtelties of Solaris 10. As far as I can guess,
> the future SunMPI might rely on OpenMPI, so it makes sense to work in
> the meanwhile with LAM/MPI. I don't need to advertise LAM/MPI any more
> on this list either ;-).
>
> I am using LAM/MPI 7.1.1 compiled using the Sun Studio 10 compilers.
> Simple MPI codes run fine over all the nodes and cpus. Complex MPI
> codes are also running on individual 4-way nodes with excellent
> performances.
>
> Now I am using a crude Alltoall benchmark to stress the whole cluster
> and identify the bottlenecks and problems, and I hope I'll get some
> useful feedback from this list.
>
> - Using default setup for ethernet cards and for the HP ProCurve 2848
> results in poor performances and even in freezing the bench when the
> number of nodes is augmented.
Yikes. This scares me -- there's no reason that correct tests should
freeze. Can you verify that all the hardware and TCP stacks are
working correctly?
> - We then forced the speed to 1000-fdx and enabled flow control on the
> switch. This resulted in a big performance improvement and reduced
> the occurence of freezing issues with a large number of nodes.
Reduced or eliminated?
> So far, so good. However I come to performance issues.
>
> I illustrate with an example using 13 nodes (the slice size is
> expressed in bytes).
[snipped the performance numbers]
Note that alltoall is quite a heavy test -- it sends data from each
process to each other process. This can cause a massive amount of data
flow. LAM/MPI also has a rather naieve implementation of the Alltoall
algorithm: in each process, we just start a send to every other peer
process in the communicator and then MPI_Waitall() to let them progress
and complete.
The OS and the switch are therefore going to have a lot to do with
performance of this algorithm. Since LAM just throws all the data up
in the air, it's relying on both the OS And the switch to sort it all
out and make sure that all the data gets to where it's supposed to go.
This can cause lots and lots of congestion, as you've noticed (it isn't
much of a factor for short messages, as you've noticed).
> From above numbers one can draw some conclusions.
>
> - except in the largest case, small slice sizes result in small
> elapsed time, seemingly dominated by gigabit latencies.
> - increasing the slice size improves the throughput with little impact
> on elapsed time until some limiting value as indicated in table below:
> nodes, procs, slice limit
> 13, 13, 16384
> 13, 52, 16384
> 23, 23, 4096
> 23, 92, 512 or 1024
>
> Beyond this slice limiting value, the elapsed time increases first by
> 1 or 2 orders of magnitude and then generally *reduces* when the slice
> size is augmented... Excepted for the biggest case, very good
> performances are restored for very large slice sizes.
I'm guessing that this has to be related to how the OS and switch are
handling the data transfers.
> The behaviour seems erratic. I also experimented on another cluster
> (linux, 13 bi-xeon nodes). The trend is similar, however the limiting
> slice limits are different. I see no logic in this behaviour and I
> suspect some bottleneck in the kernel IP stack, LAM/MPI or both. The
> freezing issues and the beneficial role of enabling flow control on
> gigabit let me suspect primarily IP congestion problems at system
> level. However LAM/MPI has also some performances issues when the
> message size goes beyond the "small message" 64K limit, and might also
> be unfair with respect to IP bandwidth saturation.
You're right -- LAM defaults to switching between short and long
message protocols (i.e., eager vs. rendezvous) at 64k. You can change
this value by changing the SSI parameter rpi_tcp_short to a larger
value (e.g., 128k, 256k, etc.). For GigE, I would suspect that 64k is
too low (even for normal point-to-point sends). Try bumping it up and
look at normal point-to-point latency/bandwidth values. Then try your
alltoall test again.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|