On Wed, 4 Jun 2003, Peter McLachlan wrote:
> I'm trying to run HPL on a large cluster and I'm hitting performance
> problems I'm trying to track down. I'm looking for suggestions on
> probable cause for these problems.
>
> Our setup:
> I'm currently trying to run it across 196 dual processor x86 (P4)
> homogeneous nodes using LAM 6.5.9 compiled with gcc 3.2.2. HPL and ATLAS
> 3.4.1 were also compiled with gcc 3.2.2.
Just a quick note - you might want to try out the LAM/MPI 7.0 betas (soon
to be non-beta, we promise!). Much of what I'm going to describe below
can be tweaked at run-time rather than compile time, which will make your
life much easier. Heck, you could probably get away with tuning with
7.0beta then using those parameters with 6.5.9 until 7.0 goes stable.
> The LAM, HPL, gcc & ATLAS libraries and executables are shared to the
> nodes over NFS mounts. All nodes are connected and communicate over
> gigabit ethernet links using a jumbo MTU of 9000.
So, I notice you mention using netperf to get TCP/UDP numbers. While this
gives up a decent idea of what your TCP stack is capable of doing, it
gives you no indication of what the MPI implementation is capable of
doing. Unfortunately, with many of the Linux GigE drivers, performance
can get kind of hary once you start going through the entire MPI stack. I
don't know if Netperf has an MPI interface, but you might want to play
with that to get some bandwidth numbers when running under LAM. NetPIPE
from Ames Lab has both TCP and MPI interfaces and has proven very useful
to us in the past.
> When I run the xhpl binary it is using only a small amount of the CPU
> power on the nodes. If I ssh to a given node it will be between 30-80% of
> the available CPU time. Often at the lower end of that range. In fact my
> GFLOPs seems to suffer the more I increase the number of nodes. This to
> me suggests a communication problem of some sort.
I know nothing about your application, but there are usually two reasons
for scalability issues - not enough bandwidth or latency sensitivity. If
your application is doing tons of small send/recv operations, then latency
is probably the issue. There isn't much I can recommend here, other than
rewriting the app to be less sensitive to latency. Well, or use a network
with lower latency, like Myrinet.
If your application is using medium sized messages (~64KB - 128KB), you
might be able to get better performance by increasing the size of the
short message/long message cross-over point in LAM. I believe that there
have been reports of increasing this above the default 64KB helps
performance on GigE with large MTUs. But that might not be the case all
the time.
On the bandwidth side, LAM should be able to get pretty close to peak TCP.
NetPipe can help expose problems if you aren't getting close. There are
some GigE drivers that just perform abysmally when runnig MPI
applications, so you might want to try some other versions of the driver.
If you have unexpected receives, there is also a high penalty for
bandwidth. An unexpected receive causes LAM to memcpy() data around,
which will always kill performance. However, it's even worse on Linux
because the memcpy() on most glibc implementations has some massive
performance issues if data isn't aligned just so. We have a workaround in
LAM 7.0, but not in 6.5.9.
> When I run mpirun with the -lamd option performance improves somewhat but
> still not to expected levels. Interestingly the lamd daemon in this mode
> uses 30-40% CPU and the xhpl binaries each use 30-40% cpu.
This makes me think you are seeing application blocking issues, rather
than bandwidth issues. The lamd mode can allow some asynchronous behavior
in the MPI - the message is sent over a fast local Unix domain socket,
then sent (while the MPI app is still plugging away) in the background
through the lamd. Because the lamd is doing the slow, off-node
communication, the MPI application can go back to working while the lamd
is communicating. The problem is that the lamd isn't very efficient at
sending messages and uses much CPU time.
At this point, I'd start with playing with the TCP short/long message
length and NetPIPE and make sure LAM is behaving as you expect. If that
doesn't help, I'd use a profiling tool or XMPI to look at your
communication patterns and see if there is anything you can do in your
application to improve things. If that doesn't work, it is possible that
your application just won't scale that far on your particular hardware.
By the way, there is a really good User's Guide in PDF format in the LAM
7.0beta releases (you might actually want to grab the latest version of
the PDF out of the nightly CVS build) in the doc/ directory. It has
everything you could possibly want to know about tuning LAM/MPI. However,
remember that in 6.5.9, the tuning parameters are all compile-time only.
The environment variables were only introduces for LAM 7.0. Most of the
User's Guide is 7.0 specific, but it might be helpful for 6.5.9.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|