LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Yu Chen (chen_at_[hidden])
Date: 2005-01-14 09:46:38


Hi, Jeff

Just tried your suggestions, here are some outputs from "gdb", but I don't
know where to continue now, seems for lamd and tcp, it just sit, for sysv,
file is missing?

-------running with "-ssi rpi lamd" and "-ssi rpi tcp------------
Reading symbols from
/raid1/p12/hhmi/software/Cyana/gnu-lam/cyanaexe.gnu-lam...done.
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Reading symbols from /lib/libutil.so.1...done.
Loaded symbols for /lib/libutil.so.1
Reading symbols from /usr/lib/libg2c.so.0...done.
Loaded symbols for /usr/lib/libg2c.so.0
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread -1218517440 (LWP 28311)]
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0x007329ee in __read_nocancel () from /lib/tls/libpthread.so.0
(gdb) step
Single stepping until exit from function __read_nocancel,
which has no line number information.
=============== then (gdb) just stop here =============

------------- running with "-ssi rpi ------------------
Reading symbols from
/raid1/p12/hhmi/software/Cyana/gnu-lam/cyanaexe.gnu-lam...done.
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Reading symbols from /lib/libutil.so.1...done.
Loaded symbols for /lib/libutil.so.1
Reading symbols from /usr/lib/libg2c.so.0...done.
Loaded symbols for /usr/lib/libg2c.so.0
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread -1218484672 (LWP 28743)]
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0x00ed5726 in semop () from /lib/tls/libc.so.6
(gdb) step
Single stepping until exit from function semop,
which has no line number information.
[Switching to Thread -1218484672 (LWP 28743)]
lam_ssi_rpi_sysv_readlock (p=0x4037ef48) at ssi_rpi_sysv_shm.c:296
296 ssi_rpi_sysv_shm.c: No such file or directory.
         in ssi_rpi_sysv_shm.c
===================================================================

Any clue? Thanks in advance!

Regards,
Chen

On Thu, 13 Jan 2005, Jeff Squyres wrote:

> On Jan 13, 2005, at 1:42 PM, Yu Chen wrote:
>
>>> It's hard to say without more detail about your application; this could
>>> simply be the communication pattern of your application, that it causes
>>> blocking and makes processes wait for message passing to complete, etc.
>>
>> But that program worked in provious setup, and it never got changed (only
>> difference is the different FORTRAN compiler, PGI vs GNU)
>
> I wish I had a better answer, but "sometimes this just happens" -- there are
> a *lot* of differences between the 6.x and 7.x series in LAM, any number of
> which could (and did!) expose bugs in user applications.
>
> Not that I'm claiming that LAM is 100% bug-free -- no software ever is! But
> it's pretty darn stable and lots of people are running production codes with
> it. Of course, that being said, if we do find a genuine bug that your
> application exposes in LAM, I'll be the first to a) eat crow, and b) fix the
> little bugger in LAM.
>
>>> Can you attach a debugger to any of the processes and see what they are
>>> doing?
>>
>> I really don't know how to do it, could you help me with this.
>
> When the processes are running on your nodes, login to any of the nodes and
> run "ps" to find the PID's of the two processes on that node (I assume you're
> launching 2 processes per node). Then run "gdb --pid <PID>", replacing <PID>
> with one of the PIDs of your processes.
>
> This will attach to the process and show you where it is in the process (it's
> most helpful if you have compiled your application with -g). It will show
> you a stack trace of where the application is currently executing. From
> there, you can do all the normal things that you do in gdb (step, next,
> examine variables, go up and down the stack trace, etc.).
>
> You might want to do this simultaneously on several different processes to
> see where they are all blocked.
>
> I also strongly recommend running your application through a memory-checking
> debugger such as the most recent version of valgrind
> (http://valgrind.kde.org). Even if you think your application is running
> properly, valgrind can illuminate all kinds of hidden bugs that you weren't
> even aware were there (we use Valgrind and other memory-checking debuggers in
> developing LAM, for example). Note that with the default install of LAM on
> OSCAR clusters, you'll unfortunately get a lot of false positive reports from
> valgrind about reads from uninitialized memory deep within LAM. These are
> all actually ok; to avoid a long story, suffice it to say that it's actually
> a safe optimization that we use in LAM that Valgrind is unaware of. When you
> compile LAM from source, you can use the configure switch --with-purify to
> eliminate these false positive reports, but there is a *slight* performance
> hit for doing this, so we don't enable it by default (i.e., it removes the
> optimization).
>
> See the LAM FAQ for debugging for a few more hints:
>
> http://www.lam-mpi.org/faq/
>
>

===========================================
Yu Chen
Howard Hughes Medical Institute
Chemistry Building, Rm 182
University of Maryland at Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250

phone: (410)455-6347 (primary)
         (410)455-2718 (secondary)
fax: (410)455-1174
email: chen_at_[hidden]
===========================================