LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Angel Tsankov (fn42551_at_[hidden])
Date: 2005-11-21 06:50:59


Hallo!

I'm using LAM/MPI 7.1.1 on a a clluster of four dual-processor G4
PowerPCs running yellow dog. LAM has been configured as follows:
             LAM/MPI: 7.1.1
              Prefix: /home/angel/Install/LAM
        Architecture: powerpc-unknown-linux-gnu
       Configured by: angel
       Configured on: Sat Jul 2 11:05:30 EEST 2005
      Configure host: parmac3.bas.bg
      Memory manager: ptmalloc2
          C bindings: yes
        C++ bindings: yes
    Fortran bindings: yes
          C compiler: gcc
        C++ compiler: g++
    Fortran compiler: g77
     Fortran symbols: double_underscore
         C profiling: yes
       C++ profiling: yes
   Fortran profiling: yes
      C++ exceptions: yes
      Thread support: yes
       ROMIO support: yes
        IMPI support: no
       Debug support: no
        Purify clean: no
            SSI boot: globus (API v1.1, Module v0.6)
            SSI boot: rsh (API v1.1, Module v1.1)
            SSI boot: slurm (API v1.1, Module v1.0)
            SSI boot: tm (API v1.1, Module v1.1)
            SSI coll: lam_basic (API v1.1, Module v7.1)
            SSI coll: shmem (API v1.1, Module v1.0)
            SSI coll: smp (API v1.1, Module v1.2)
             SSI rpi: crtcp (API v1.1, Module v1.1)
             SSI rpi: lamd (API v1.0, Module v7.1)
             SSI rpi: sysv (API v1.0, Module v7.1)
             SSI rpi: tcp (API v1.0, Module v7.1)
             SSI rpi: usysv (API v1.0, Module v7.1)
              SSI cr: self (API v1.0, Module v1.0)

Yesterday, I ran a program to solve a linear system of equations using
the CG method. I ran the program several times, each time solving a
bigger system. I noticed that little systems are solved faster on two
processors in the same node than on two processors in different nodes.
This should come as no surprise, since shared memory is used for intra
node communications. However, large systems are sloved faster on two
processors in different nodes (communicating over 100BASE-T local area
Ethernet), rather than on two processors in the same node. This did
somewhat surprise me, although in the case of Ethernet communications
there is significat overlapping of computations and communications.
The volume of data transferred in either direction on each CG
iteration is:
1KB x 8B = 8KB in the case of a middle-sized system; in this case the
running times are roughly the same no matter whether shared memory or
Ethernet is used;
16KB x 8B = 128KB in the case with the largest system.

Can someone help me explain this anomaly?