LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Anthony J. Ciani (aciani1_at_[hidden])
Date: 2005-04-20 12:23:54


Hello Users,

An SMP Opteron system IS a shared memory system. There is no particular
difference visible to the user level programs, and thusly nothing that the
LAM developers can do about it, with one exception: On SGI's shared
memory machines there can be something like CPU ownership. For NUMA to
work properly, it might be possible to implement something like this for
Opterons? Of course, the kernel is supposed to work to spread out the
ownership (if NUMA is enabled).

At the kernel level, make certain that you have enabled NUMA (Non Unified
Memory Architecture) support. If you are having such poor scaling with
this enabled, then you may want to contact the code maintainer for that
feature.

At the hardware level, make certain that your main memory is evenly
distributed amongst your CPUs (more particularly, the embedded memory
controllers). This is probably the most critical part for Opterons. If
the memory is not evenly distributed, than CPUs will be forced to share
memory bandwidth, and give crappy scaling.

Your numbers indicate that all of the memory is being allocated on a
single CPU, either because it isn't evenly distributed, or because NUMA
support is not enabled in your kernel.

On Sat, 16 Apr 2005, Eugene de Villiers wrote:
> Hello all,
>
> I have been testing the performance of an IWILL H8501 8-way opteron system
> and have been getting some surprisingly bad results. Running the same code on
> different numbers of nodes produces the following speed-ups:
> 2 cpus - 2.25 (yes, super-linear)
> 4 cpus - 3.3
> 8 cpus - 5.5
>
> After the 2-node run I was quite impressed, but the latter two calculations
> put a damper on things, especially considering that we have seen super-linear
> scaling on the Infiniband opteron cluster out to 16 cpus using similar
> software. I ran all the tests using both LAM/MPI-7.1.1 (*) and mpich-1.2.6
> with shmem enabled (the LAM setup wins by about 5%). The inter-process
> communications load in all the above cases would have been fairly similar, so
> my guess is the large difference in performance is probably down to a
> combination of Hyper Transport limitations and shared memory being a not
> quite accurate description of the opteron architecture.
>
> 1. Has anyone else encountered this problem?
> 2. Are there SSI settings I can change or workarounds to improve the
> situation?
> 3. Would a hyper transport specific RPI improve matters or is this more
> likely to be a capacity/latency issue?
>
> I just find it hard to swallow that a 6.4Gb/s 800MHz interconnect produces
> worse scaling than gigabit ethernet. I would be glad to perform and post more
> tests if it would help.
>
> Thanks for your time,
> Eugene
>
> * ./configure \
> --prefix=$LAM_ARCH_PATH \
> --enable-shared \
> --disable-static \
> --without-romio \
> --without-mpi2cpp \
> --without-profiling \
> --without-fc
>
>
>

------------------------------------------------------------
               Anthony Ciani (aciani1_at_[hidden])
            Computational Condensed Matter Physics
    Department of Physics, University of Illinois, Chicago
               http://ciani.phy.uic.edu/~tony
------------------------------------------------------------