[ Sorry for the late reply... ]
On Sat, 11 Aug 2007, Greg Blair wrote:
> We can tolerate an exchange data drop out but cannot tolerate excessive
> timeouts, say greater than 20 msec.
Then I'd say that you have chosen poorly MPI over TCP/IP for data
exchange between processes. Something like UDP seems a lot more
apropriate, possibly with some control mechanisms like RDP (Reliable
Datagram Protocol) or even better RTP (Real-time Transport Protocol)
which is often used for video/audio transmissions with the same
characteristics as your transmission: dropping is bad, delay is worse.
> 4. Recompiling the kernel with TCP timeout reduced from 250 to 50 msec
> - helps but does not solve the problem.
This just allows the kernel to notice that a packet might be missing
and retry transmission - it only eases the symptoms, but does not cure
tha cause. You can check this by looking for retransmission count
amoung the TCP statistics (f.e. 'netstat --statistics --tcp')
> 5. Changing 1 gigE switches - same problems but frequency of problem
> varies with switch.
This seems to indicate that the hardware side is reponsible for
loosing packets. It doesn't necessarily mean that the switch is bad,
could also be a problem of cabling, network cards and especially link
negotiation between card and switch port.
> 9. Switched from LAM 2.1.1 to 2.1.2 to 2.1.3 - no change (We have not
> tried 2.1.4)
LAM is currently at 7.1.x, is this a typo on your side ???
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]
|