LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2005-03-30 13:29:16


On Mar 30, 2005, at 12:48 PM, Kumar, Ravi Ranjan wrote:
>
> Results from parallel code agree quite well with the results from
> serial code.
> However, the scalability is poor. When I increase number of
> processes or nodes,
> there is not much improvement in the turnaround time. Serial code
> takes 225
> seconds
> whereas for the same conditions parallel code with:
>
> 3 processes on a single node take 167 seconds,
> 3 processes on 3 nodes take 115 seconds,
> 4 processes on 4 nodes take 111 seconds
> 10 processes on 5 nodes take 88 seconds,
>
> Is this much scalability acceptable? I was expecting better
> performance, at
> leat 10 times. Is there anyway I can improve the speedup? I am
> using LAM-MPI on
> linux (Linux k00 2.4.17 ) cluster conneted via Ethernet (TCP/IP - i
> do not know
> much about networking). Below is the cluster details:

On all but the most trivial programs (which yours is not), a linear
speedup should not be expected. Amdahl's law (http://www.google.com/
search?q=amdahl's+law) pretty much guarantees that linear speedup is
impossible (with the exception of things like randomized algorithms
or cache effects or the like). So 10x is out of the question.

At 10 nodes, you are only seeing 2.5x speedup, which is pretty bad.
I doubt that your code has that many serial sections in it. However,
it does appear to have lots of communication overhead, which can
drastically reduce your scalability. The one that pops right out is
the MPI_Allreduce() in the inner loop. Allreduce is one of the most
expensive operations in MPI - it's basically a gather, some math, and
then a broadcast. Without ignoring strong recommendations from the
MPI standard, it is hard to optimize the function, especially for
floating point numbers. Depending on your algorithm, it may be
cheaper to run some small number of iterations between checking for
convergence. Or find some more distributed way to determine
convergence. Some of this depends on the time to run the rest of
your inner loop. The other communication you don't include in your
pseudo-code may also be the cause of your problems. I wouldn't
recommend any changes to your code until you know where your
performance issues are coming from.

I would recommend using MPE from Argonne National Lab to generate a
trace of your MPI communication. The Jumpshot tool included with MPE
provides a nice way to visualize the communication overheads of your
application and can find areas where you should focus your
optimization efforts.

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/