I have problems with mysterious communication delays up to 35 ms.
The delay time is spent during blocking send/probe(/receive).
The sizes of the outstandig messages are in the range from 70 to
130 (datatype char). The delays occured on serveral ranks,
not only on a special node.
The delays started to occur at about 7 nodes. In runs with less that
7 all went well. When using more nodes, the frequency of the
delays increased, e.g. with 11 nodes there were 9 such stalls,
with 9 nodes 6 stalls.
I am using lam 7.0.4 on linux 2.4.18-10 (RedHat 7.3) with tcp
and client-to-client mode. I used xmpi for tracing timelines.
I could reproduce the behavior in several runs, the points where
it happened were allmost the same across all runs.
I tried to reproduce it with the netcat, but the max communication
latency did not rise in such a manner as described above.
Has anyone experienced a similar situation?
Any ideas welcome!
Thanks,
Mathias Kurth.
|