> Hello Tim and All,
>
> There are some network parameters which MAY help MPI jobs run faster over
> TCP/IP.
>
> First, for pre 2.4.x kernels (which doesn't apply to RH EL3_U3), the
> default maximum packet size could be increased from 32K to 256K. 2.4.x
> and later kernels can increase the maximum packet size automatically
> in increments of 32K.
>
> Second, there was some talk when GigE was first introduced of increasing
> the maximum ethernet frame size from 1500 bytes to 9000 bytes (aka Jumbo
> Frames). In the early days of slow GigE switches, this did help reduce
> fragmentation, and improved throughput a good 90% (because clients were
> talking longer before the switch switched). With modern "wire-speed"
> switches, there is no switching delay, so jumbo and normal frame sizes
> transfer at about the same rate. Jumbo frames DO reduce the computational
> overhead of TCP/IP communication, cutting the amount of processor time
> spent for the actual transmission in half.
We have been benchmarking some witches at the University of Vienna and found
that switches that support jumbo frams (or actually just user defined MTUs)
preform quite bad with 9k as the internal memory management of some devices
seems to be tuned for the default 1500 bytes - so at a MTU of 9k we would get
no more than 560Mbit unidirectional and 970 bidirectional (at 1600 bytes the
swithch shows about 1560 bidirectional and 930 unidirectional at very large
packets) bencmakrs done with NetPipe-3.0a.
So if you want to take advantage of jumbo frames one realy has to reevaluate
the switch performance.
>
> As an example, let's say a program which uses a total CPU time of 7200
> seconds transfers 100GB of data (this is a lot for such a short running
> program). The actual transfer of data takes 850 seconds, and uses 250
> seconds on the CPUs (assuming 3.0 GHz) for system time. This means the
> actual computations took 6950 seconds of CPU time. Using jumbo frames,
> the system time is halved to 125 seconds, so now the job takes 6950 + 125
> = 7075 seconds, about 2% faster than with normal frames.
>
> Also, if the program used non-blocking calls, then that 850 seconds to
> transfer the data isn't really noticed, but if the calls are blocking,
> then that 850 seconds adds to the elapsed time (7200 + 850 = 8050). If
> the switching equipment is old, then that 850 seconds becomes 1700
> seconds (7200 + 1700 = 8900), which is a good 10% slower.
>
>
> If you are using a modern >2.4.x kernel, a modern LAM >7.0, then there
> really isn't any tuning that can be done to improve performance (outside
> of LAM), apart from jumbo frames, which the above example shows to be
> quite a marginal improvement on modern "wire-speed" switches. If your
> switches are old, then it might help, if the switches and network drivers
> support jumbo frames.
Depending on what you are using there are also a few module parameters that
can be tuned to improve network throughput. You can check the MTRR settings
(cat /proc/mtrr). Some NICs allow to set the events that will be handled per
interrupt (i.e. max_interrupt_work=##) depending on if your tasks are
throuput or latency sensitive this can change things, depending on the
load profile of the app the memory management can be tuned on some cards
(i.e. the e1000 allows setting of the number of transmit descriptors).
So I would say there are a number of screws to play with but there is no
one parameter that will do it.
hofrat
|