On Wed, Jun 04, 2003 at 11:30:10PM -0400, Peter McLachlan wrote:
>nodes at a time. I'd love to see a meshed implementation that allowed you
>to test bandwidth simultaneously between N nodes. (I know, I know, I
>should stop complaining and write one :P)
the PMB benchmark does this.
http://www.pallas.com/e/products/pmb/
>running is the Linpack cluster benchmark. (HPL == High Performance
>Linpack I believe). It should scale well I would think. The latency
it scales fairly linearly.
>numbers seem to be fine from netPIPE, latency for a 131kb message for
>instance is 0.001512s which seems reasonable.
There are a million tunables with the HPL code you probably have to try
many of them before you get a good set for your particular machine.
The HPL tuning guide will help. The first step is to choose a problem
size that uses (almost) all the memory.
Also if you are trying for top500.org rather than just for fun, then
you should be using Intel Blas/MKL as it's much faster than ATLAS.
There are other tweaks out there, but I'll leave you to find those :)
Turning off hyperthreading (boot with 'noht') will also probably help.
Are you running with something like 'mpirun C ...'?
Locality is important for most parallel codes, and the C option puts
neighbouring processes on the same node.
Also the -O option is important for lam-6.5.9, but lam-7 auto-detects
this I believe.
cheers,
robin
|