LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bussoletti, John E (john.e.bussoletti_at_[hidden])
Date: 2004-05-24 12:34:29


Richard,
You might look at CFL3D, a NASA/LaRC code. It's a multiblock structured
grid code and has some tools for estimation of work to try to balance
the workload among multiple CPUs, recommending how many to use for best
performance. OVERFLOW is another possibility. It's know to scale
extremely well on shared memory machines such as the SGI Origin or
Altix. On distributed systems it is reported to scale less well.

In general, all parallel codes run slower as more CPUs are used. The
ideal of a linear speed up is all part of the elusive "perfect scaling"
target. But there is overhead in communication and that almost always
leads to something less than perfect scaling. There are some "super
linear" behaviors that are observed on some systems, when there's a
better cache fit for all the data being processes that gets beyond some
single CPU bottleneck, but that's rare.

There is also a non-trivial dependence on problem size. For a fixed
problem size, parallel running leads to a smaller and smaller amount of
work on each CPU in the parallel universe. Ultimately there's a trivial
amount of computation compared to communication and scaling becomes
poor.

I've used the following approach to help size a case to a parallel
system: Assume Twall is the wall time required to solve a problem using
some number of cpus, C. Then we can model the performance of a system
by the equation:
        Twall = Ts + Tp/C
Now run the case with C1 CPUs and C2 CPUs and you can solve for the two
coefficients. Then you can predict how the case will perform for an
arbitrary number of CPUs. You can get more complicated and say:
        Ts = Ts0 + Tsi * I
        Tp = Tp0 + Tpi * I
Where I is the number of iterations, multigrid cycles or whatever
measure you like to characterize the units of work expended in solving
the problem. Then run each of the two previous cases with a varying
number of iterations to determine the coefficients Ts0, Tsi, Tp0, Tpi.
Now you need a minimum of four runs, varying C and I and measuring
Twall.

Then you can get even more complicated and make the various terms also
depend on a polynomial in the number of cells, edges, or nodes, whatever
makes sense for your problem. Now you vary the grid size as well and
look over a sequence of various runs and calibrate the coefficients.

These calibrations are, of course, subject to variation and you can only
use them for modest variations. But this sort of scheme can be useful
in sizing a system for various classes of problems and getting some
idea, say within around 20% to 40% of the run times required using
various combinations of CPUs.

Perfect scaling is achieved when Ts=0.

John Bussoletti

-----Original Message-----
From: Richard Brown [mailto:richardlbj_at_[hidden]]
Sent: Monday, May 24, 2004 7:19 AM
To: lam_at_[hidden]
Subject: LAM: Looking for demo CFD code

Hi, everyone:

I am looking for free/demo/eval CFD code to demostrate
lam and Linux cluster. I have tried nast3dgp and ccse,
both run slower as more CPUs are used (that's not what
I want to show). duns had too much trouble even
getting compiled. Any infomation would be appreciated.

Thanks,
Richard

        
                
__________________________________
Do you Yahoo!?
Yahoo! Domains - Claim yours for only $14.70/year
http://smallbusiness.promotions.yahoo.com/offer
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/