LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-01-06 18:01:30


On Jan 5, 2006, at 9:57 PM, p p wrote:

> i) We have an MPMD application with 2 executables,
> let's say A and B. When we want to run the
> application, we use mpirun with an appschema file, in
> order to put all processes under the same
> MPI_COMM_WORLD. In this file we have:
> -np <number-of-procs-a> A <some arguments>
> -np <number-of-procs-b> B <some other arguments>
> The question is this: how can we attach a debugger to
> those processes (to process 0 or even to all of them)?
> I have tried many methods, based on the debugging info
> in the lam documentation, but nothing seemed to work.

Do you have Totalview? It *should* just work if you:

        mpirun -tv appschema

If you do not have Totalview, you should be able to add "xterm -e
gdb ..." or "valgrind ..." or whatever debugger you want before each
application in the app schema. Something like this:

-np <npA> xterm -e gdb A
-np <npB> xterm -e gdb B

This will bring up an xterm running "gdb A" and "gdb B",
respectively. Variations on this are also possible (only limited by
scripting imagination here).

Let me know if this works.

> ii) The reason that i want to use the debugger in this
> particular scenario (MPMD application), is that our
> MPMD application runs without any errors on a single
> machine, but fails to run on two (or more) machines.

Yikes. Sounds like a classic case of a race condition causing
badness...

> The error is located when the MPI_Comm_split function
> is called. However, we have tested this function
> thoroughly on small test programs and on both single
> and multiple machine execution and everything worked
> just fine. So, most probably the problem occurs
> because of some memory leak. Is there any possibility
> that the nature of this problem is not connected to
> memory leaks?

Memory leaks are not typically harmful in this way -- they just leak
memory (i.e., eat up resources). There might be some other kind of
memory badness, like a buffer overflow or somesuch.

Using a memory-checking debugging such as valgrind can be really
helpful to find such things -- check out www.valgrind.org. Be sure
to see the LAM FAQ (if you haven't already) for some important
caveats about using LAM with memory-checking debuggers.

[combining all your mails into one]

> Perhaps a useful clue to the problem i sent you
> before, is that when i run all of the processes on the
> machine that has initially booted the lam environment,
> via lamboot, the application is executed without any
> errors. However, when i run the all the processes on
> the same machine but not the one that has booted the
> lam environment, via lamboot, i get the same error as
> if i run the processes on all of the available
> machines !!!

Yuk -- these kinds of errors are the worst. Without more
information, it sounds like a subtle timing / race condition that
manifests on some of your platforms and not others. There is nothing
special about the origin node in a LAM universe (i.e., the node that
executed lamboot). We only record which node ran lamboot by setting
an additional bit on a flag deep in the lamd for historical reasons
-- that information is not used anywhere except for reporting in the
lamnodes command.

> The error message is the following:

This indicates that your process seg faulted. Are you getting a core
dump? If so, you should be able to load it up in a debugger and at
least examine the state of the process when it died.

You might want to look at your shell limits -- ensure that
coredumpsize is unlimited (you may need to set this in your shell
startup files such as .bashrc or .cshrc to ensure that it takes
effect on remote nodes).

> I have tried to call many MPI functions before
> calling the MPI_Comm_split function (this is where the
> errors occur). It seems that the only family of
> functions that cause the problem is the communicator
> construction one, meaning MPI_Comm_split,
> MPI_Comm_create, MPI_Comm_dup, MPI_Comm_spawn,
> MPI_Intercomm_create, MPI_Intercomm_merge (i have not
> tested the last two, since i could not create two
> intracommunicators from MPI_COMM_WORLD). All the other
> functions (p2p, collective communication, ...) work
> correctly. Furthermore, when i test the communicator
> construction functions on small test programs they
> work just fine.

All of these functions should be working fine -- they've been passing
our internal regression tests for quite a long time.

You might want to double and triple check the parameters that are
getting passed in to those functions. If those all look ok, run the
application through valgrind and see if it reports anything interesting.

Let us know what happens.

What version of LAM/MPI are you using, and what RPI are you using?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/