Hi LAM folks!
Following Robins advice (thanks Robin), I changed my kernel and e1000
driver. The kernel was changed to 2.4.20-20.8, several others were
tested. The kernel was not responsible for the lame performance of the
MPI communication. I tested several versions of the Intel drivers. The
recent versions tend to be slow, while the earlier versions in my case
tend to be unstable. Version 4.4.12 is now showing latency of 60 usec
and 850Mbit/sec bandwidth, full duplex, (both NetPipe and my own Fortran
application - NetPipe TCP numbers are exactly the same).
What might be the difference between the 4.4.12 and 4.4.12-k1 e1000
versions?
Now, that I got the right performance of the communication, I've tried
to overlap the communication by some computations. The computations are
- like our CFD applications - memory-bound (moving a couple of large
arrays in and out of cache). Is this the reason that the effect of
overlapping communications and computations show only marginal reduction
(up to 20%) of the communication time (sum of time-difference of the 3
MPI calls)?
Are there any general hints to get some real effect of overlap (this is
for a cluster of single processor systems)?
|