Hello all,
I have been testing the performance of an IWILL H8501 8-way opteron
system and have been getting some surprisingly bad results. Running the
same code on different numbers of nodes produces the following speed-ups:
2 cpus - 2.25 (yes, super-linear)
4 cpus - 3.3
8 cpus - 5.5
After the 2-node run I was quite impressed, but the latter two
calculations put a damper on things, especially considering that we have
seen super-linear scaling on the Infiniband opteron cluster out to 16
cpus using similar software. I ran all the tests using both
LAM/MPI-7.1.1 (*) and mpich-1.2.6 with shmem enabled (the LAM setup wins
by about 5%). The inter-process communications load in all the above
cases would have been fairly similar, so my guess is the large
difference in performance is probably down to a combination of Hyper
Transport limitations and shared memory being a not quite accurate
description of the opteron architecture.
1. Has anyone else encountered this problem?
2. Are there SSI settings I can change or workarounds to improve the
situation?
3. Would a hyper transport specific RPI improve matters or is this more
likely to be a capacity/latency issue?
I just find it hard to swallow that a 6.4Gb/s 800MHz interconnect
produces worse scaling than gigabit ethernet. I would be glad to perform
and post more tests if it would help.
Thanks for your time,
Eugene
* ./configure \
--prefix=$LAM_ARCH_PATH \
--enable-shared \
--disable-static \
--without-romio \
--without-mpi2cpp \
--without-profiling \
--without-fc
|