Bogdan Costescu wrote:
>[ Sorry for coming so late in the thread...]
>
>
You are welcomed !
>On Tue, 16 Aug 2005, Pierre Valiron wrote:
>
>
>
>>However I have performed some experiments with Fortran code after
>>enabling hardware flow control on the gigabit interfaces.
>>
>>
>
>If enabling hardware flow control improves performance, the switch
>that you are using might be the bottleneck in that it might not be
>able to cope with the simultaneous transfers from all (or most of) its
>ports that result from all-to-all communication. This is typical for a
>switch with a backbone bandwidth lower than the sum of bandwidths of
>all the ports.
>
>
As I already replied in a previous mail, I don't expect to have a
bandwidth bottleneck on the Proserve 2848 switch. However I might get a
limitation with the number of packets per second if they come to be very
small.
>
>
>>I can understand why this contention is less severe for small
>>buffers, which may fit in IP and TCP stacks
>>
>>
>
>On Linux, there are system wide variables that allow setting the
>buffer dimensions for TCP (/proc/sys/net/ipv4/tcp_*mem) - maybe you
>can find something similar for Solaris, easier now that the source is
>available...
>
>
I have already tried of course, with little success so far. The
documentation is still behind the real system status, and I would need
to go to the source of the kernel as you suggest, which scares me a bit !
>
>
>>In order to limit the concurrency per interface, the buffers should be
>>exchanged in an orderly fashion, with a single buffer being read and
>>written at a time through a given interface.
>>
>>
>
>This reduces indeed the network contention, but probaby increases
>latency due to the increased number of context switches and the
>waiting that is done in userspace. When you just push all your data to
>the kernel, transmissions can be optimized in kernel, with higher
>timing precision and with less context changes.
>
>I would like to ask for another data point: can you try using your
>all-to-all algorithm, but disable the hardware flow control ? Based on
>the theory at least, due to the orderd pairwise communications, the
>switch should be less likely to saturate now and the hardware flow
>control should not make that much difference (if the switch is indeed
>the bottleneck)
>
>
I have tried your suggestion.
With hardware flow control enabled, I get
valiron_at_n11 ~ > mpirun N a.out
NPROCS 22 ALGO 4
ChunkSize 8000
buf_size, sent/node, iter_time (s), rate, rate/node (MB/s)
8 100 0.002062 3.418 0.155
16 100 0.002047 6.888 0.313
32 100 0.002027 13.913 0.632
64 100 0.002196 25.677 1.167
128 100 0.002076 54.345 2.470
256 100 0.002175 103.732 4.715
512 100 0.002195 205.529 9.342
1024 10 0.002399 376.060 17.094
2048 10 0.002603 693.296 31.513
4096 10 0.002973 1213.941 55.179
8192 10 0.004451 1621.770 73.717
16384 10 0.004918 2935.586 133.436
32768 10 0.007646 3776.631 171.665
65536 10 0.015845 3644.735 165.670
131072 10 0.029629 3898.152 177.189
262144 10 0.056227 4108.352 186.743
524288 10 0.113784 4060.338 184.561
1048576 2 0.225317 4100.898 186.404
2097152 2 0.460307 4014.708 182.487
4194304 2 0.917907 4026.552 183.025
8388608 2 1.838465 4020.745 182.761
And with flow control disabled, I get contention again beyonf 16 K buffers:
valiron_at_n11 ~ > mpirun N a.out
NPROCS 22 ALGO 4
ChunkSize 8000
buf_size, sent/node, iter_time (s), rate, rate/node (MB/s)
8 100 0.002171 3.248 0.148
16 100 0.002084 6.766 0.308
32 100 0.002077 13.578 0.617
64 100 0.002120 26.606 1.209
128 100 0.002062 54.704 2.487
256 100 0.002473 91.232 4.147
512 100 0.002304 195.787 8.899
1024 10 0.002330 387.337 17.606
2048 10 0.002766 652.507 29.659
4096 10 0.002996 1204.639 54.756
8192 10 0.003633 1986.786 90.308
16384 10 0.004932 2927.171 133.053
32768 10 0.976329 29.575 1.344
[very slow now]
So up to a buffer size of 16 K, I get exactly the same figures, then a
slow down by 2 orders of magnitude.
Hard to conclude if the contention comes from kernel or switch...
However as I don't get so much contention for little packets, and as the
switch is presumably more limited by the packet handling than by the
bandwidth, I suspect the contention to be related to the Solaris 10
kernel. I am aware this remains a very wavy conclusion ;-)
Regards.
Pierre.
PS. I also exercised the new scheduling algorithm by Ralf for an odd
number of procs.
As I anticipated, the self to self copy is so quick that I can't see any
difference over the gigabit for 3 to 21 nodes. I also tried using 3
processors within a quadriprocessor node, however the timing
fluctuations are so large that it is hard to see the difference either
with my best code (concurrency level=2).
If I revert to a concurrency level of 1, then I can see a little
improvement using 3 processors within a quadriprocessor node.
Consequently the new Ralf's scheduling algorithm is theoretically
better, but makes little difference in practice on my platform. I guess
this could be different if I had a faster interconnect.
--
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
_/_/_/_/ _/ _/ Dr. Pierre VALIRON
_/ _/ _/ _/ Laboratoire d'Astrophysique
_/ _/ _/ _/ Observatoire de Grenoble / UJF
_/_/_/_/ _/ _/ BP 53 F-38041 Grenoble Cedex 9 (France)
_/ _/ _/ http://www-laog.obs.ujf-grenoble.fr/~valiron/
_/ _/ _/ Mail: Pierre.Valiron_at_[hidden]
_/ _/ _/ Phone: +33 4 7651 4787 Fax: +33 4 7644 8821
_/ _/_/
|