> On Nov 21, 2005, at 6:50 AM, Angel Tsankov wrote:
>
>> Yesterday, I ran a program to solve a linear system of equations
>> using
>> the CG method. I ran the program several times, each time solving a
>> bigger system. I noticed that little systems are solved faster on
>> two
>> processors in the same node than on two processors in different
>> nodes.
>> This should come as no surprise, since shared memory is used for
>> intra
>> node communications. However, large systems are sloved faster on
>> two
>> processors in different nodes (communicating over 100BASE-T local
>> area
>> Ethernet), rather than on two processors in the same node. This did
>> somewhat surprise me, although in the case of Ethernet
>> communications
>> there is significat overlapping of computations and communications.
>> The volume of data transferred in either direction on each CG
>> iteration is:
>> 1KB x 8B = 8KB in the case of a middle-sized system; in this case
>> the
>> running times are roughly the same no matter whether shared memory
>> or
>> Ethernet is used;
>> 16KB x 8B = 128KB in the case with the largest system.
>
> There can be a lot of reasons for this -- the general rule of thumb
> is
> "every application is different."
>
> How much memory is your application using? If the sum of the memory
> used by you two processes exceeds the amount of available physical
> memory, you can cause performance degradation (i.e., the cost of
> virtual memory swapping can outweigh the gains of faster
> communication
> via shared memory). Although I certainly can't say for sure that
> this
> is what is happening, it is a relatively common cause.
>
> Does this help?
>
This is a good point, as well as the one that John Robinson suggested.
However, the amount of memory used by the program (on singel CPU) is
as follows:
4-5MB for the small system
30-31 MB for the middle-sized system
250-252MB for the large system
The amount of available physical memory per node (2 CPUs) is 512MB.
In fact, the issue of swapping was the first that came to my mind as
well. However, I also doubt that this is the reason.
It seems to me that other activities in the system could contribute to
the longer execution times when both copies of the program execeutee
on the same node. In this case both CPU are busy executing not only
the CG solver but other applications as well. In the other case one of
the CPUs in each node executes the CG solver while the other CPU is
free to execute other applications.
I've noticed, however, that the activity of other applications is very
low. Moreover, the solver has been executed 5 times for each matrix
size (and each type of communication) and the times are almost the
same for each of the 5 runs. This way I make sure the times are
accurate.
Another possible reason is that ther is that more communications are
overlapped with computations in the case of larger systems.
Of course, other ideas are welcome.
And finally, I just wonder if communication time depends on the amount
of data to be transferred when shared memory is used.
Angel
|