LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Yu Chen (chen_at_[hidden])
Date: 2005-01-13 14:37:06


Thanks a lot, Jeff, I will try that

Chen

On Thu, 13 Jan 2005, Jeff Squyres wrote:

> On Jan 13, 2005, at 1:42 PM, Yu Chen wrote:
>
>>> It's hard to say without more detail about your application; this could
>>> simply be the communication pattern of your application, that it causes
>>> blocking and makes processes wait for message passing to complete, etc.
>>
>> But that program worked in provious setup, and it never got changed (only
>> difference is the different FORTRAN compiler, PGI vs GNU)
>
> I wish I had a better answer, but "sometimes this just happens" -- there are
> a *lot* of differences between the 6.x and 7.x series in LAM, any number of
> which could (and did!) expose bugs in user applications.
>
> Not that I'm claiming that LAM is 100% bug-free -- no software ever is! But
> it's pretty darn stable and lots of people are running production codes with
> it. Of course, that being said, if we do find a genuine bug that your
> application exposes in LAM, I'll be the first to a) eat crow, and b) fix the
> little bugger in LAM.
>
>>> Can you attach a debugger to any of the processes and see what they are
>>> doing?
>>
>> I really don't know how to do it, could you help me with this.
>
> When the processes are running on your nodes, login to any of the nodes and
> run "ps" to find the PID's of the two processes on that node (I assume you're
> launching 2 processes per node). Then run "gdb --pid <PID>", replacing <PID>
> with one of the PIDs of your processes.
>
> This will attach to the process and show you where it is in the process (it's
> most helpful if you have compiled your application with -g). It will show
> you a stack trace of where the application is currently executing. From
> there, you can do all the normal things that you do in gdb (step, next,
> examine variables, go up and down the stack trace, etc.).
>
> You might want to do this simultaneously on several different processes to
> see where they are all blocked.
>
> I also strongly recommend running your application through a memory-checking
> debugger such as the most recent version of valgrind
> (http://valgrind.kde.org). Even if you think your application is running
> properly, valgrind can illuminate all kinds of hidden bugs that you weren't
> even aware were there (we use Valgrind and other memory-checking debuggers in
> developing LAM, for example). Note that with the default install of LAM on
> OSCAR clusters, you'll unfortunately get a lot of false positive reports from
> valgrind about reads from uninitialized memory deep within LAM. These are
> all actually ok; to avoid a long story, suffice it to say that it's actually
> a safe optimization that we use in LAM that Valgrind is unaware of. When you
> compile LAM from source, you can use the configure switch --with-purify to
> eliminate these false positive reports, but there is a *slight* performance
> hit for doing this, so we don't enable it by default (i.e., it removes the
> optimization).
>
> See the LAM FAQ for debugging for a few more hints:
>
> http://www.lam-mpi.org/faq/
>
>

===========================================
Yu Chen
Howard Hughes Medical Institute
Chemistry Building, Rm 182
University of Maryland at Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250

phone: (410)455-6347 (primary)
         (410)455-2718 (secondary)
fax: (410)455-1174
email: chen_at_[hidden]
===========================================