On Nov 21, 2005, at 6:50 AM, Angel Tsankov wrote:
> Yesterday, I ran a program to solve a linear system of equations using
> the CG method. I ran the program several times, each time solving a
> bigger system. I noticed that little systems are solved faster on two
> processors in the same node than on two processors in different nodes.
> This should come as no surprise, since shared memory is used for intra
> node communications. However, large systems are sloved faster on two
> processors in different nodes (communicating over 100BASE-T local area
> Ethernet), rather than on two processors in the same node. This did
> somewhat surprise me, although in the case of Ethernet communications
> there is significat overlapping of computations and communications.
> The volume of data transferred in either direction on each CG
> iteration is:
> 1KB x 8B = 8KB in the case of a middle-sized system; in this case the
> running times are roughly the same no matter whether shared memory or
> Ethernet is used;
> 16KB x 8B = 128KB in the case with the largest system.
There can be a lot of reasons for this -- the general rule of thumb is
"every application is different."
How much memory is your application using? If the sum of the memory
used by you two processes exceeds the amount of available physical
memory, you can cause performance degradation (i.e., the cost of
virtual memory swapping can outweigh the gains of faster communication
via shared memory). Although I certainly can't say for sure that this
is what is happening, it is a relatively common cause.
Does this help?
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|