LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Tim Prince (tprince_at_[hidden])
Date: 2005-06-15 21:06:07


Arvind Gopu wrote:

>On 2005-06-15 05:50 (-0700), Tim Prince had pondered:
>
>
>
>>>It's another issue, though, that the performance takes a big hit when I
>>>use both processors on the same node. That behavior is common to LAM and
>>>MPICH (on our system) and we've thought about a whole bunch of possible
>>>reasons.
>>>
>>>
>>>
>>Are you raising a new topic, with a top post on a thread? It's entirely
>>possible for the first process on a node to tie up the memory system, leaving
>>no gain for the 2nd process. If you don't have enough RAM, or (since yours
>>appears to be a 32-bit system) enough address space for 2 processes, this is
>>nearly guaranteed.
>>
>>
>>
>
>Tim - Well, I thought, I'd make a rough mention of it. But yeah, it's not
>related to John M's problem, it's an entirely different issue. And no,
>it's not a case of our code being memory/addr.space starved. We've
>discussed this some (with my colleagues, etc.) and come up with
>various possible reasons including the one you've mentioned.
>
>Here's a note I sent to Jeff yesterday - should've CCed this group on it,
>my bad! (Adding to it... Profiling the code did not help; I didn't look at
>cache misses and stuff like that though)
>
>
>+++++
>
>On 2005-06-14 13:06 (-0500), Arvind Gopu had pondered:
>
>
>
>>Hello Jeff-
>>
>>Thought I'll ping you with a followup to my note sent to John at Caltech.
>>
>>As I mentioned, I've noticed a dramatic decrease in performance when I
>>use both processors in a 2-proc node. And we're talking: a simple MPI
>>(toy) program that has a message passing component. And it's on AVIDD..not
>>much change in behavior if I use Myrinet/Ethernet; Static/Dynamic
>>linking;etc.
>>
>>Let's say the serial program takes 4 mins..running the parallel code on 4
>>processors on 4 different nodes takes 1 min where as running on 4
>>processors on 2 nodes takes almost 2 mins. I tried searching for similar
>>experiences - I did find one MPICH-Myrinet related webpage (of relevance)
>>where they talked about mem-copy issues and advised use of Myrinet for
>>intra-node communication too, but that did not solve the problem (w.r.t
>>MPICH..and neither w.r.t LAM if I used different SSI RPIs)
>>
>>I've discussed this with a few people within UITS and we could think of
>>cache-trashing type issues, bus-related limitations, IO pipe bottlenecks,
>>etc (a few other things, I can't remember of the top of my head). But I am
>>not sure if we're missing something more important or something obvious.
>>
>>If I had to explain to Joe User who asks (someone did): "why is my almost
>>trivially ||'izable program not showing expected speedup?"..would this
>>(silly) analogy make sense: "Subway might have a bunch of people working,
>>more than one knife, etc. But since they have only one toaster, it might
>>take longer if you wanted your sub toasted" :-)
>>
>>Sorry about that silly analogy, but a bit of humor usually does not hurt!
>>
>>
>>

If Myrinet makes this much difference to performance on fewer than 12
nodes, you are spending excessive time on communications. I haven't
heard of it being done with lam-mpi, but use of on-board ethernet card
communications capabilities have been reported to give performance equal
to Myrinet, without the reliability problems, on clusters of 6 or more
nodes. Infiniband should give better performance and reliability at
less cost, but it shouldn't be needed on small clusters.
We have shown memory system dependent problems on single nodes, where
the 2nd lam-mpi process is not gaining any performance. In
over-simplified terms, it is possible for a single process to use up all
the effective memory bandwidth. It may happen even with simple
memset() operations. In such operations, on Intel CPUs, performance
might be gained by disabling hardware prefetch, if it were feasible to
do so just for that operation.
gprof profiling of a lam-mpi run for 1 and 2 processes should show which
functions have satisfactory performance gains. From there, you could
make educated guesses on the reasons, or use a hardware event profiler
such as oprofile or Vtune. It's likely not to be specifically an MPI
problem.