LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-01-28 10:23:56


It looks like you are running LAM's mpirun itself through valgrind,
not your application. See the LAM FAQ for details on how to run your
app through a memory-checking debugger under the category of
"Debugging MPI programs under LAM/MPI". Notr in particular that LAM
needs to be configured/installed in a specific way to avoid some
false positives with memory-checking debuggers.

If your LAM installation is missing the gm module, then it was
probably not configured with the path to the GM installation when it
was installed. You need to supply --with-rpi-gm=/path/to/gm/
installation on the configure line to LAM/MPI. See the LAM/MPI
Installation Guide for more details.

Finally, note that using an OS-bypass network like Myrinet will make
valgrind report all kinds of false positives because the GM module in
the kernel will reading, writing, and potentially allocating memory
for your application that valgrind does not have visibility of. It
is probably better to use valgrind with the tcp RPI. Once you have
your application running properly, you can then use the gm RPI for
better performance.

On Jan 28, 2006, at 2:22 AM, Ton Oguara wrote:

> Hi,
>
> OK, my code reads and processes details of 32 trace file (text
> files) of
> about 14KB each, but bombs when the size of each trace file is about
> 100KB giving the following error message.
>
> MPI_Recv: process in local group is dead (rank 4, MPI_COMM_WORLD)
> Rank (4, MPI_COMM_WORLD): Call stack within LAM:
> Rank (4, MPI_COMM_WORLD): - MPI_Recv()
> Rank (4, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (4, MPI_COMM_WORLD): - main()
> MPI_Recv: process in local group is dead (rank 8, MPI_COMM_WORLD)
> Rank (8, MPI_COMM_WORLD): Call stack within LAM:
> Rank (8, MPI_COMM_WORLD): - MPI_Recv()
> Rank (8, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (8, MPI_COMM_WORLD): - main()
> MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)
> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> Rank (2, MPI_COMM_WORLD): - MPI_Recv()
> Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (2, MPI_COMM_WORLD): - main()
> ----------------------------------------------------------------------
> -------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 5256 failed on node n0 (10.0.0.33) due to signal 6.
> ----------------------------------------------------------------------
> -------
> MPI_Recv: process in local group is dead (rank 1, MPI_COMM_WORLD)
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Recv()
> Rank (1, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (1, MPI_COMM_WORLD): - main()
>
>
> Next I ran my code through Valgrind and got the following leaks...
> Could
> this be the reason why ??
> I am using LAM 7.1.1 on a linux environment with Myrinet hardware,
> although I am not running my apps with the gm RPI module as the module
> is not installed...
>
> ==4678== ERROR SUMMARY: 413 errors from 10 contexts (suppressed: 19
> from 1)
> ==4678== malloc/free: in use at exit: 1,272 bytes in 42 blocks.
> ==4678== malloc/free: 291 allocs, 249 frees, 26,946 bytes allocated.
> ==4678== For counts of detected errors, rerun with: -v
> ==4678== searching for pointers to 42 not-freed blocks.
> ==4678== checked 117,864 bytes.
> ==4678==
> ==4678==
> ==4678== 13 bytes in 4 blocks are definitely lost in loss record 5
> of 13
> ==4678== at 0x401A639: malloc (vg_replace_malloc.c:149)
> ==4678== by 0x804D2C4: sfh_argv_add (in /usr/bin/mpirun)
> ==4678== by 0x804FC19: asc_compat (in /usr/bin/mpirun)
> ==4678== by 0x804A38B: main (in /usr/bin/mpirun)
> ==4678==
> ==4678==
> ==4678== 100 (52 direct, 48 indirect) bytes in 3 blocks are definitely
> lost in loss record 10 of 13
> ==4678== at 0x401B9F1: realloc (vg_replace_malloc.c:306)
> ==4678== by 0x804D324: sfh_argv_add (in /usr/bin/mpirun)
> ==4678== by 0x804D848: sfh_argv_break_quoted (in /usr/bin/mpirun)
> ==4678== by 0x804F484: parseline (in /usr/bin/mpirun)
> ==4678== by 0x804F2FD: asc_bufparse (in /usr/bin/mpirun)
> ==4678== by 0x804B1D0: build_app (in /usr/bin/mpirun)
> ==4678== by 0x804A6AC: main (in /usr/bin/mpirun)
> ==4678==
> ==4678== LEAK SUMMARY:
> ==4678== definitely lost: 65 bytes in 7 blocks.
> ==4678== indirectly lost: 48 bytes in 8 blocks.
> ==4678== possibly lost: 0 bytes in 0 blocks.
> ==4678== still reachable: 1,159 bytes in 27 blocks.
> ==4678== suppressed: 0 bytes in 0 blocks.
> ==4678== Reachable blocks (those to which a pointer was found) are not
> shown.
> ==4678== To see them, rerun with: --show-reachable=yes
>
> Many thanks
>> Ton
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/