On Sat, Jun 14, 2003 at 04:40:01PM -0500, Brian W. Barrett wrote:
>There are a couple of flaws with your estimates. The theoretical peak
>performance for a modern CPU isn't exactly 1Ghz == 1GFlop. Intel doesn't
For Xeon it's 2 Floating Ops per Hz. This is because of SSE/MMX/SSE2 et al.
>make their peak performance estimates readily available, so I'm not
>exactly sure what peak would be for a single CPU. But you are seeing
In reality with HPL you get a max of about 1.4 Floating Ops/Hz (70% of
peak) which you can see from a bit of analysis of the Intel machines on
the top500.org website. Using icc/ifc, mkl (diasabling OpenMP usually)
and http://www.cs.utexas.edu/users/flame/goto/ gives you the best results.
>improved a bit more by using a good compiler instead of GCC. Then (as has
pretty much all of HPL is spent in dgemm, so compiler doesn't really
matter much actually. Latency and bandwidth do though, so stick with LAM :)
Remember the -O option to mpirun also!
>(and note that the parameters that give best realized performance on a
>single node most likely will not be the same that give best performance
>across the cluster). But you still aren't going to get great scalability
>with HPL.
I think HPL scales well, and certainly loves lots of memory per node,
but yes - you are correct - many of the tweakables in HPL are for
parallel mode only and because of the complex interactions between
parameters (which means O(week) of trying semi-random combinations) the
serial speed doesn't tell you all that much.
>However, if your applications all are going to be sensitive to latency and
>bandwidth constraints, this is when you take the HPL numbers, wave them
>around, and beg for money for a faster interconnect.
:-)
cheers,
robin
|