On Jan 13, 2005, at 1:42 PM, Yu Chen wrote:
>> It's hard to say without more detail about your application; this
>> could simply be the communication pattern of your application, that
>> it causes blocking and makes processes wait for message passing to
>> complete, etc.
>
> But that program worked in provious setup, and it never got changed
> (only difference is the different FORTRAN compiler, PGI vs GNU)
I wish I had a better answer, but "sometimes this just happens" --
there are a *lot* of differences between the 6.x and 7.x series in LAM,
any number of which could (and did!) expose bugs in user applications.
Not that I'm claiming that LAM is 100% bug-free -- no software ever is!
But it's pretty darn stable and lots of people are running production
codes with it. Of course, that being said, if we do find a genuine bug
that your application exposes in LAM, I'll be the first to a) eat crow,
and b) fix the little bugger in LAM.
>> Can you attach a debugger to any of the processes and see what they
>> are doing?
>
> I really don't know how to do it, could you help me with this.
When the processes are running on your nodes, login to any of the nodes
and run "ps" to find the PID's of the two processes on that node (I
assume you're launching 2 processes per node). Then run "gdb --pid
<PID>", replacing <PID> with one of the PIDs of your processes.
This will attach to the process and show you where it is in the process
(it's most helpful if you have compiled your application with -g). It
will show you a stack trace of where the application is currently
executing. From there, you can do all the normal things that you do in
gdb (step, next, examine variables, go up and down the stack trace,
etc.).
You might want to do this simultaneously on several different processes
to see where they are all blocked.
I also strongly recommend running your application through a
memory-checking debugger such as the most recent version of valgrind
(http://valgrind.kde.org). Even if you think your application is
running properly, valgrind can illuminate all kinds of hidden bugs that
you weren't even aware were there (we use Valgrind and other
memory-checking debuggers in developing LAM, for example). Note that
with the default install of LAM on OSCAR clusters, you'll unfortunately
get a lot of false positive reports from valgrind about reads from
uninitialized memory deep within LAM. These are all actually ok; to
avoid a long story, suffice it to say that it's actually a safe
optimization that we use in LAM that Valgrind is unaware of. When you
compile LAM from source, you can use the configure switch --with-purify
to eliminate these false positive reports, but there is a *slight*
performance hit for doing this, so we don't enable it by default (i.e.,
it removes the optimization).
See the LAM FAQ for debugging for a few more hints:
http://www.lam-mpi.org/faq/
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|