LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Anthony J. Ciani (aciani1_at_[hidden])
Date: 2005-01-02 19:11:00


Hello Everyone,

On Fri, 17 Dec 2004, Tim Prince wrote:
> At 03:54 PM 12/17/2004, Anthony J. Ciani wrote:
>
>> Hello Tim and All,
>>
>> There are some network parameters which MAY help MPI jobs run faster over
>> TCP/IP.
>
> Thanks.
> We're testing over the weekend with tcp_rmem and tcp_wmem maxima increased to
> 8MB. I got a small apparent improvement on a single node lam-mpi test, on a
> machine with a "pro/100" equivalent chip set. The test cluster has gigabit,
> as well as Infiniband cards.
> Just spent a while replacing all CPUs on the cluster with faster ones, while
> learning how to straighten CPU pins bent in shipping.

tcp_rmem and tcp_wmem are good places to start, but don't forget to
increase the maximum socket memory (core.rmem_max, core.wmem_max) to
around twice the maximum tcp packet size.

Another thing to try is changing the size of the transmit and receive
queues;
ifconfig eth0 txqueuelen 2000
echo 2000 >/proc/sys/net/core/netdev_max_backlog

You can also turn off sack, and timestamping, which are useless for
clusters and LANs (but good for WANs);

echo 0 > /proc/sys/net/ipv4/tcp_sack
echo 0 > /proc/sys/net/ipv4/tcp_timestamps

and turn on low_latency algorithms;
echo 1 > /proc/sys/net/ipv4/tcp_low_latency

although LAM already sets TCP_NODELAY which prevents the 'clumping' of
small frames into larger ones.

If your driver does not support NAPI, then you might try to see if you can
change the amount of work done per interrupt (interrupt coalescing). In
the tg3 driver, these are defined in tg3.h with variables such as:
#define DEFAULT_TXCOL_TICKS 0x0000012c
#define DEFAULT_TXCOAL_TICK_INT 0x00000019
they are probably different in other drivers, and you should probably
write the driver maintainer to ask about changing them. (tg3 does
support NAPI for RX)

I have experimented with these (the sysctl stuff) and found very minimal
improvements (1-2%). Also, the program we commonly use was written to
avoid various short commings in past versions of MPI implementations. After
removing some of those work-arounds there was also a minor improvement.

Just to give some numbers for typical GigE clusters;
Lattency (ping): 70 micro-sec
Data in transit: 9000 bytes
Packets per sec: 86000 (86000 interrupts per second)

The key value here is the amount of data in transit. It's only 9000 bytes
(6 full frames). Most modern machines can handle the transmission or
reception of each frame, from ethernet<->tcp<->socket<->program in
about 10 micro-seconds, which is less than the latency. Because of
this, the buffers stay empty. To say it more simply, the buffer
only needs to hold the data in transit, or 9000 bytes. Of course, it's
still a little better to let the program fill up the buffer, and then let
the CPU move on to other things.

The buffers only contain the user data to be transmitted or just received.
The transmit and receive queues contain the actual ethernet frames.
Usually, you only want to store, at most, a few milli-seconds worth
of frames to transmit (around 1,000) so that other programs don't
see a big delay on the network. With dedicated nodes, a single program
can hog the queue, so the queues could hold half the size of the
tcp buffers, or about 1400 frames for a 4M buffer.

Of course, for a good GigE network, modifying the buffers and queues fromm
the defaults won't net you much.

------------------------------------------------------------
               Anthony Ciani (aciani1_at_[hidden])
            Computational Condensed Matter Physics
    Department of Physics, University of Illinois, Chicago
               http://ciani.phy.uic.edu/~tony
------------------------------------------------------------