Hallo!
I'm using LAM/MPI 7.1.1 on a a clluster of four dual-processor G4
PowerPCs running yellow dog. LAM has been configured as follows:
LAM/MPI: 7.1.1
Prefix: /home/angel/Install/LAM
Architecture: powerpc-unknown-linux-gnu
Configured by: angel
Configured on: Sat Jul 2 11:05:30 EEST 2005
Configure host: parmac3.bas.bg
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++
Fortran compiler: g77
Fortran symbols: double_underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: yes
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI boot: tm (API v1.1, Module v1.1)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: self (API v1.0, Module v1.0)
Yesterday, I ran a program to solve a linear system of equations using
the CG method. I ran the program several times, each time solving a
bigger system. I noticed that little systems are solved faster on two
processors in the same node than on two processors in different nodes.
This should come as no surprise, since shared memory is used for intra
node communications. However, large systems are sloved faster on two
processors in different nodes (communicating over 100BASE-T local area
Ethernet), rather than on two processors in the same node. This did
somewhat surprise me, although in the case of Ethernet communications
there is significat overlapping of computations and communications.
The volume of data transferred in either direction on each CG
iteration is:
1KB x 8B = 8KB in the case of a middle-sized system; in this case the
running times are roughly the same no matter whether shared memory or
Ethernet is used;
16KB x 8B = 128KB in the case with the largest system.
Can someone help me explain this anomaly?
|