LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Robinson (jr_at_[hidden])
Date: 2005-11-21 08:20:29


Sounds like the bigger problem is using main memory. Two boxes have
higher total memory bandwidth. Smaller problems probably run from
processor cache.

/jr

---
Angel Tsankov wrote:
> Hallo!
> 
> I'm using LAM/MPI 7.1.1 on a a clluster of four dual-processor G4 
> PowerPCs running yellow dog. LAM has been configured as follows:
>              LAM/MPI: 7.1.1
>               Prefix: /home/angel/Install/LAM
>         Architecture: powerpc-unknown-linux-gnu
>        Configured by: angel
>        Configured on: Sat Jul  2 11:05:30 EEST 2005
>       Configure host: parmac3.bas.bg
>       Memory manager: ptmalloc2
>           C bindings: yes
>         C++ bindings: yes
>     Fortran bindings: yes
>           C compiler: gcc
>         C++ compiler: g++
>     Fortran compiler: g77
>      Fortran symbols: double_underscore
>          C profiling: yes
>        C++ profiling: yes
>    Fortran profiling: yes
>       C++ exceptions: yes
>       Thread support: yes
>        ROMIO support: yes
>         IMPI support: no
>        Debug support: no
>         Purify clean: no
>             SSI boot: globus (API v1.1, Module v0.6)
>             SSI boot: rsh (API v1.1, Module v1.1)
>             SSI boot: slurm (API v1.1, Module v1.0)
>             SSI boot: tm (API v1.1, Module v1.1)
>             SSI coll: lam_basic (API v1.1, Module v7.1)
>             SSI coll: shmem (API v1.1, Module v1.0)
>             SSI coll: smp (API v1.1, Module v1.2)
>              SSI rpi: crtcp (API v1.1, Module v1.1)
>              SSI rpi: lamd (API v1.0, Module v7.1)
>              SSI rpi: sysv (API v1.0, Module v7.1)
>              SSI rpi: tcp (API v1.0, Module v7.1)
>              SSI rpi: usysv (API v1.0, Module v7.1)
>               SSI cr: self (API v1.0, Module v1.0)
> 
> Yesterday, I ran a program to solve a linear system of equations using 
> the CG method. I ran the program several times, each time solving a 
> bigger system. I noticed that little systems are solved faster on two 
> processors in the same node than on two processors in different nodes. 
> This should come as no surprise, since shared memory is used for intra 
> node communications. However, large systems are sloved faster on two 
> processors in different nodes (communicating over 100BASE-T local area 
> Ethernet), rather than on two processors in the same node. This did 
> somewhat surprise me, although in the case of Ethernet communications 
> there is significat overlapping of computations and communications. 
> The volume of data transferred in either direction on each CG 
> iteration is:
> 1KB x 8B = 8KB in the case of a middle-sized system; in this case the 
> running times are roughly the same no matter whether shared memory or 
> Ethernet is used;
> 16KB x 8B = 128KB in the case with the largest system.
> 
> Can someone help me explain this anomaly? 
> 
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/