I'm trying to run HPL on a large cluster and I'm hitting performance
problems I'm trying to track down. I'm looking for suggestions on
probable cause for these problems.
Our setup:
I'm currently trying to run it across 196 dual processor x86 (P4)
homogeneous nodes using LAM 6.5.9 compiled with gcc 3.2.2. HPL and ATLAS
3.4.1 were also compiled with gcc 3.2.2.
The LAM, HPL, gcc & ATLAS libraries and executables are shared to the
nodes over NFS mounts. All nodes are connected and communicate over
gigabit ethernet links using a jumbo MTU of 9000.
lamboot initializes the nodes just fine, tping and the pi sample seem to
run perfectly.
The problem:
When I run the xhpl binary it is using only a small amount of the CPU
power on the nodes. If I ssh to a given node it will be between 30-80% of
the available CPU time. Often at the lower end of that range. In fact my
GFLOPs seems to suffer the more I increase the number of nodes. This to
me suggests a communication problem of some sort.
When I run mpirun with the -lamd option performance improves somewhat but
still not to expected levels. Interestingly the lamd daemon in this mode
uses 30-40% CPU and the xhpl binaries each use 30-40% cpu.
The symptoms seem to indicate a network performance problem however when I
test point to point performance between nodes using netperf I am seeing
tcp/udp streams in the 990mbit/s range.
Are there other possibilities I have yet to look into? Does anyone have
any suggestions on a troubleshooting technique to pin this one down? Any
help much appreciated.
Regards,
Peter McLachlan
|