Sounds like the bigger problem is using main memory. Two boxes have
higher total memory bandwidth. Smaller problems probably run from
processor cache.
/jr
---
Angel Tsankov wrote:
> Hallo!
>
> I'm using LAM/MPI 7.1.1 on a a clluster of four dual-processor G4
> PowerPCs running yellow dog. LAM has been configured as follows:
> LAM/MPI: 7.1.1
> Prefix: /home/angel/Install/LAM
> Architecture: powerpc-unknown-linux-gnu
> Configured by: angel
> Configured on: Sat Jul 2 11:05:30 EEST 2005
> Configure host: parmac3.bas.bg
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: yes
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI boot: tm (API v1.1, Module v1.1)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: self (API v1.0, Module v1.0)
>
> Yesterday, I ran a program to solve a linear system of equations using
> the CG method. I ran the program several times, each time solving a
> bigger system. I noticed that little systems are solved faster on two
> processors in the same node than on two processors in different nodes.
> This should come as no surprise, since shared memory is used for intra
> node communications. However, large systems are sloved faster on two
> processors in different nodes (communicating over 100BASE-T local area
> Ethernet), rather than on two processors in the same node. This did
> somewhat surprise me, although in the case of Ethernet communications
> there is significat overlapping of computations and communications.
> The volume of data transferred in either direction on each CG
> iteration is:
> 1KB x 8B = 8KB in the case of a middle-sized system; in this case the
> running times are roughly the same no matter whether shared memory or
> Ethernet is used;
> 16KB x 8B = 128KB in the case with the largest system.
>
> Can someone help me explain this anomaly?
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|