LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bogdan Costescu (Bogdan.Costescu_at_[hidden])
Date: 2007-08-17 12:34:32


[ Sorry for the late reply... ]

On Sat, 11 Aug 2007, Greg Blair wrote:

> We can tolerate an exchange data drop out but cannot tolerate excessive
> timeouts, say greater than 20 msec.

Then I'd say that you have chosen poorly MPI over TCP/IP for data
exchange between processes. Something like UDP seems a lot more
apropriate, possibly with some control mechanisms like RDP (Reliable
Datagram Protocol) or even better RTP (Real-time Transport Protocol)
which is often used for video/audio transmissions with the same
characteristics as your transmission: dropping is bad, delay is worse.

> 4. Recompiling the kernel with TCP timeout reduced from 250 to 50 msec
> - helps but does not solve the problem.

This just allows the kernel to notice that a packet might be missing
and retry transmission - it only eases the symptoms, but does not cure
tha cause. You can check this by looking for retransmission count
amoung the TCP statistics (f.e. 'netstat --statistics --tcp')

> 5. Changing 1 gigE switches - same problems but frequency of problem
> varies with switch.

This seems to indicate that the hardware side is reponsible for
loosing packets. It doesn't necessarily mean that the switch is bad,
could also be a problem of cabling, network cards and especially link
negotiation between card and switch port.

> 9. Switched from LAM 2.1.1 to 2.1.2 to 2.1.3 - no change (We have not
> tried 2.1.4)

LAM is currently at 7.1.x, is this a typo on your side ???

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]