LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-06-14 16:40:01


Hello -

There are a couple of flaws with your estimates. The theoretical peak
performance for a modern CPU isn't exactly 1Ghz == 1GFlop. Intel doesn't
make their peak performance estimates readily available, so I'm not
exactly sure what peak would be for a single CPU. But you are seeing
about 60% of your assumed peak on single CPU. This can probably be
improved by using the Intel math libraries instead of Atlas. And then
improved a bit more by using a good compiler instead of GCC. Then (as has
been mentioned recently on the list), there are a whole bunch of tuning
parameters in HPL that you can mess with the get better performance.

So, once you have decent performance on a single node, you have the
problem of scaling to multiple nodes. Let's assume for the moment that
1.58GFlop is peak for your CPU. Then peak for your cluster is 25.28 GFlop
(1.58GF * 16). You saw about 9.64Gfop, which is about 40% of our assumed
peak. The tuning parameters in HPL will make a difference here as well
(and note that the parameters that give best realized performance on a
single node most likely will not be the same that give best performance
across the cluster). But you still aren't going to get great scalability
with HPL.

On your cluster, the speed of the network (in both bandwidth and latency)
is dwarfed by the speed of your processors. HPL will expose the latency
problems in your cluster (which are just inherent in a 100Mb ethernet
interconnect), greatly reducing the scalability of your benchmark. This
may or may not be a problem - your cluster isn't going to make it in the
top of the Top500, so unless the applications running on your cluster
mimic HPL's performance characteristics, none of this is really an issue.
However, if your applications all are going to be sensitive to latency and
bandwidth constraints, this is when you take the HPL numbers, wave them
around, and beg for money for a faster interconnect.

Hope this helps,

Brian

On Sat, 14 Jun 2003, Gaurav Jain wrote:

> Hi all,
>
> I am tried to built a 8-node cluster with following configs
>
> Dual-cpu Intel Xeon 2.6 GHz (hyperthreading disabled)
> 1 GB RAM
> 100 Mbps LAN
>
> and
>
> RedHat Linux 8.0 with kernel 2.4.20 (with OpenMosix patch)
> LAM-MPI 6.5.9
> ATLAS
>
> When I run HPL (www.netlib.org/benchmark/hpl) with N=(5000-25000) and
> different combinations of options,
> over 1 cpu(with SMP disabled), I get peak performance of 1.58 GFlops
> (N=8000, NB=128)
> over 2 cpu(with smp enabled), I get peak performance of 2.85 GFlops
> (N=10000, NB=128)
> over 16 cpu(8-node cluster), I get peak performance of 9.64 GFlops
> (N=10000, NB=128)
>
> As I understand it,
> with one CPU, the peak performance is 2.6 GHz * 2 = 5.2 GFlops
> with one node, the peak performance is 2.6 GHz * 2 * 2 = 10.4 GFlops
> with eight nodes, the peak performance is 2.6 GHz * 2 * 2 * 8 = 83.2 GFlops
>
> But, I am getting only 11% of peak performance. Any pointers, what I may be
> doing wrong?
>
> Regards,
> Gaurav Jain
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
  Brian Barrett
  LAM/MPI developer and all around nice guy
  Have a LAM/MPI day: http://www.lam-mpi.org/