LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2004-01-06 12:31:33


On Jan 6, 2004, at 9:03 AM, jess michelsen wrote:

> However, when I increase # CPUs from 42 to 84, the job stalls
> completely. Time increases from around 60 seconds to almost 13.000
> seconds (!). A closer look at the times for individual parts of the job
> reveals, that a limited number of calls (approximately 120 calls) to
> MPI_ALLGATHERV is responsible for the entire growth of time
> consumption.
> I double-checked this conclusion by leaving out these calls (this
> changes the computed results slightly), and the time was again around
> the 60 seconds.

Unfortunately, MPI_ALLGATHERV is a rather expensive operation - Each
processor is sending every other processor those 43KB of data. So
while you only doubling the number of nodes, you drastically increased
the amount of data going out on the network. The MPI_ALL* functions
are always going to be expensive, so you may want to see if there is a
way to remove those functions from your programs inner loops. If you
can factor your application so that data is only sent to nearest
neighbors or something like that, you will find your application scales
much better - global operations just don't scale :(.

We are working on the LAM collective operations to improve performance,
especially on large numbers of nodes and on SMP machines. LAM 7.0
provided better performance for many of the basic collectives. The
more complex operations are on their way :).

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/