Hello everybody,
I am writing this email in order to tell you about
the results of attaching a debugger to processes of a
MPMD application. We have followed your advice
(referring to Jeff) and tried valgrind to memcheck the
MPMD application and found a serious problem in memory
handling, which was not detectable when running on a
single machine. Thanks again for your invaluable
advice.
The funny part is that after correcting the memory
problem, the processes blocked when trying to
construct intercommunicators. After (many many) hours
of debugging nothing seemed to go wrong. The
intercommunicators should be constructed, but they
were not. Then, we put the -ssi coll lam_basic option
in mpirun and just like that... everything run
perfectly! Previously, we did not use any ssi flag,
having the impression that lam would take care of
selecting the appropriate module (available:
lam_basic, smp, shmem). In previous installations on
other architectures and old lam versions, we did not
have the same problem (is this normal?). Now, we use
lam v7.1.1 (with boot: rsh, rpi: lamd|crtcp|tcp and
-of course- coll:lam_basic) on a cluster of
i686-pc-linux-gnu architecture under Scientific Linux
3.0.6.
Anyway, our application is running and you 've
contributed a lot to this. Now, we plan to port it to
open-mpi. Thanks again and keep up the good work.
--- Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Jan 5, 2006, at 9:57 PM, p p wrote:
>
> > i) We have an MPMD application with 2 executables,
> > let's say A and B. When we want to run the
> > application, we use mpirun with an appschema file,
> in
> > order to put all processes under the same
> > MPI_COMM_WORLD. In this file we have:
> > -np <number-of-procs-a> A <some arguments>
> > -np <number-of-procs-b> B <some other arguments>
> > The question is this: how can we attach a debugger
> to
> > those processes (to process 0 or even to all of
> them)?
> > I have tried many methods, based on the debugging
> info
> > in the lam documentation, but nothing seemed to
> work.
>
> Do you have Totalview? It *should* just work if
> you:
>
> mpirun -tv appschema
>
> If you do not have Totalview, you should be able to
> add "xterm -e
> gdb ..." or "valgrind ..." or whatever debugger you
> want before each
> application in the app schema. Something like this:
>
> -np <npA> xterm -e gdb A
> -np <npB> xterm -e gdb B
>
> This will bring up an xterm running "gdb A" and "gdb
> B",
> respectively. Variations on this are also possible
> (only limited by
> scripting imagination here).
>
> Let me know if this works.
>
> > ii) The reason that i want to use the debugger in
> this
> > particular scenario (MPMD application), is that
> our
> > MPMD application runs without any errors on a
> single
> > machine, but fails to run on two (or more)
> machines.
>
> Yikes. Sounds like a classic case of a race
> condition causing
> badness...
>
> > The error is located when the MPI_Comm_split
> function
> > is called. However, we have tested this function
> > thoroughly on small test programs and on both
> single
> > and multiple machine execution and everything
> worked
> > just fine. So, most probably the problem occurs
> > because of some memory leak. Is there any
> possibility
> > that the nature of this problem is not connected
> to
> > memory leaks?
>
> Memory leaks are not typically harmful in this way
> -- they just leak
> memory (i.e., eat up resources). There might be
> some other kind of
> memory badness, like a buffer overflow or somesuch.
>
> Using a memory-checking debugging such as valgrind
> can be really
> helpful to find such things -- check out
> www.valgrind.org. Be sure
> to see the LAM FAQ (if you haven't already) for some
> important
> caveats about using LAM with memory-checking
> debuggers.
>
> [combining all your mails into one]
>
> > Perhaps a useful clue to the problem i sent you
> > before, is that when i run all of the processes on
> the
> > machine that has initially booted the lam
> environment,
> > via lamboot, the application is executed without
> any
> > errors. However, when i run the all the processes
> on
> > the same machine but not the one that has booted
> the
> > lam environment, via lamboot, i get the same error
> as
> > if i run the processes on all of the available
> > machines !!!
>
> Yuk -- these kinds of errors are the worst. Without
> more
> information, it sounds like a subtle timing / race
> condition that
> manifests on some of your platforms and not others.
> There is nothing
> special about the origin node in a LAM universe
> (i.e., the node that
> executed lamboot). We only record which node ran
> lamboot by setting
> an additional bit on a flag deep in the lamd for
> historical reasons
> -- that information is not used anywhere except for
> reporting in the
> lamnodes command.
>
> > The error message is the following:
>
> This indicates that your process seg faulted. Are
> you getting a core
> dump? If so, you should be able to load it up in a
> debugger and at
> least examine the state of the process when it died.
>
> You might want to look at your shell limits --
> ensure that
> coredumpsize is unlimited (you may need to set this
> in your shell
> startup files such as .bashrc or .cshrc to ensure
> that it takes
> effect on remote nodes).
>
> > I have tried to call many MPI functions before
> > calling the MPI_Comm_split function (this is where
> the
> > errors occur). It seems that the only family of
> > functions that cause the problem is the
> communicator
> > construction one, meaning MPI_Comm_split,
> > MPI_Comm_create, MPI_Comm_dup, MPI_Comm_spawn,
> > MPI_Intercomm_create, MPI_Intercomm_merge (i have
> not
> > tested the last two, since i could not create two
> > intracommunicators from MPI_COMM_WORLD). All the
> other
> > functions (p2p, collective communication, ...)
> work
> > correctly. Furthermore, when i test the
> communicator
> > construction functions on small test programs they
> > work just fine.
>
> All of these functions should be working fine --
> they've been passing
> our internal regression tests for quite a long time.
>
> You might want to double and triple check the
> parameters that are
> getting passed in to those functions. If those all
> look ok, run the
> application through valgrind and see if it reports
> anything interesting.
>
> Let us know what happens.
>
> What version of LAM/MPI are you using, and what RPI
> are you using?
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>
> _______________________________________________
> This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
>
___________________________________________________________
×ñçóéìïðïéåßôå Yahoo!;
ÂáñåèÞêáôå ôá åíï÷ëçôéêÜ ìçíýìáôá (spam); Ôï Yahoo! Mail
äéáèÝôåé ôçí êáëýôåñç äõíáôÞ ðñïóôáóßá êáôÜ ôùí åíï÷ëçôéêþí
ìçíõìÜôùí http://login.yahoo.com/config/mail?.intl=gr
|