LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Greg Blair (gblair_at_[hidden])
Date: 2007-08-11 12:37:01


We have had a 4 machine MPI application running under LAM for about a
year. Each machine acquires and processes real-time data. Information
about the acquired data is exchanged with the other 3 machines. The
system uses matched MPI Ssend/Recv calls over jumbo frame (MTU=9000)
1gigE Ethernet. Each machine is connected to a 1 gigE jumbo-frame
configured switch.

About 99.9999% of the time this works.

However the TCP transfers, underneath the MPI software layer, sometimes
time out. The kernel generates retries and eventually the TCP packet is
transferred, the Ssend/Recv calls complete and we have our data. This
creates an excessive delay for our application, the real-time
acquisition falls apart and we have to restart the system.

We can tolerate an exchange data drop out but cannot tolerate excessive
timeouts, say greater than 20 msec.

We have tried:

1. Send and Ssend calls - made no difference

2. Using standard Ethernet MTU=1500 in place of jumbo-frame MTU=9000
ethernet - jumbo-frames is about 10 % faster and does not affect the
time-out issue.

3. Kernels from 2.6.17 through to 2.6.21 - made no difference.

4. Recompiling the kernel with TCP timeout reduced from 250 to 50 msec
- helps but does not solve the problem.

5. Changing 1 gigE switches - same problems but frequency of problem
varies with switch.

6. Interrupting the Ssend/Recv calls with a SIGALRM signal generated
from a "getitimer" system call. MPI does not return an error code (as
expected). It hangs when interrupted.

7. Enabling system call interrupts with the "sysinterrupt" system call
and using the "getitimer" SIGALRM mechanism - no change - MPI still
hangs.

8. Tried GRID MPI. GRID MPI attempts to solve the bursty packet problem
by pacing the packets at a fixed spacing between packets. See
http://www.gridmpi.org/ and specifically the
http://www.gridmpi.org/publications/cluster05-matsuda.pdf document.
GRIDMPI, while plausibly explained our dilemma, did not cure it. We went
back to LAM-MPI.

9. Switched from LAM 2.1.1 to 2.1.2 to 2.1.3 - no change (We have not
tried 2.1.4)

Any thought or suggestions?