On 2005-06-16 07:20 (-0400), Jeff Squyres had pondered:
> On Jun 15, 2005, at 10:06 PM, Tim Prince wrote:
>
> >>> [lots snipped]
> >>> As I mentioned, I've noticed a dramatic decrease in performance when
> >>> I
> >>> use both processors in a 2-proc node. And we're talking: a simple MPI
> >>> (toy) program that has a message passing component. And it's on
> >>> AVIDD..not
> >>> much change in behavior if I use Myrinet/Ethernet; Static/Dynamic
> >>> linking;etc.
> >>>
> >>> Let's say the serial program takes 4 mins..running the parallel code
> >>> on 4
> >>> processors on 4 different nodes takes 1 min where as running on 4
> >>> processors on 2 nodes takes almost 2 mins.
> > [lots snipped]
> > In over-simplified terms, it is possible for a single process to use
> > up all
> > the effective memory bandwidth. It may happen even with simple
> > memset() operations. In such operations, on Intel CPUs, performance
> > might be gained by disabling hardware prefetch, if it were feasible to
> > do so just for that operation.
>
> Just to echo this -- it does sound like you're running into limitations
> of the node somehow. A simple thing to check is just to look at the
> process size of each of these 4 processors -- is twice that size
> exceeding physical memory?
Jeff -
When I was doing those tests, I tried all kinds of problem sizes and most
of the time the problem size was not more than the physical memory (but
yet there was the peformance hit).
> A better test of this might be to remove the message passing component
> (e.g., instead of having your data inputs come from MPI_RECV, hard code
> them, or have them be read from a file, or something equivalent). This
> will remove MPI and message passing from the test scenario. You should
> be able to run your same test and see what happens (timing of 4 procs
> on 4 nodes vs. 4 procs on 2 dual SMP nodes). If you are running into a
> node limitation, you'll see the same performance characteristics (~1
> min for the 4 node run, ~2 mins for the 2 node run).
Oh, I did this too - but to no avail. That's why we pretty much concluded
that it was a node-level limitation -- memory bandwidth or the like.
> And other tools like gprof will definitely help as well.
Yeah, I'll give it a shot when I can find time.
Thanks for your responses, Tim and Jeff!
Cheers, Arvind
PS: It's just a bit harder to explain to AVIDD users who ask me why (s)he
should parallelize their code :)
_____________________________________________________________________
Arvind Gopu | High Performance Computing Group| (UITS-RAC-HPC) @ IU
HPC website: http://www.indiana.edu/~rac/hpc | Work: (812) 856-0187
My website: http://cs.indiana.edu/~agopu | Cell: (812) 361-4054
|