Hi,
OK, my code reads and processes details of 32 trace file (text files) of
about 14KB each, but bombs when the size of each trace file is about
100KB giving the following error message.
MPI_Recv: process in local group is dead (rank 4, MPI_COMM_WORLD)
Rank (4, MPI_COMM_WORLD): Call stack within LAM:
Rank (4, MPI_COMM_WORLD): - MPI_Recv()
Rank (4, MPI_COMM_WORLD): - MPI_Barrier()
Rank (4, MPI_COMM_WORLD): - main()
MPI_Recv: process in local group is dead (rank 8, MPI_COMM_WORLD)
Rank (8, MPI_COMM_WORLD): Call stack within LAM:
Rank (8, MPI_COMM_WORLD): - MPI_Recv()
Rank (8, MPI_COMM_WORLD): - MPI_Barrier()
Rank (8, MPI_COMM_WORLD): - main()
MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Recv()
Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
Rank (2, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 5256 failed on node n0 (10.0.0.33) due to signal 6.
-----------------------------------------------------------------------------
MPI_Recv: process in local group is dead (rank 1, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Recv()
Rank (1, MPI_COMM_WORLD): - MPI_Barrier()
Rank (1, MPI_COMM_WORLD): - main()
Next I ran my code through Valgrind and got the following leaks... Could
this be the reason why ??
I am using LAM 7.1.1 on a linux environment with Myrinet hardware,
although I am not running my apps with the gm RPI module as the module
is not installed...
==4678== ERROR SUMMARY: 413 errors from 10 contexts (suppressed: 19 from 1)
==4678== malloc/free: in use at exit: 1,272 bytes in 42 blocks.
==4678== malloc/free: 291 allocs, 249 frees, 26,946 bytes allocated.
==4678== For counts of detected errors, rerun with: -v
==4678== searching for pointers to 42 not-freed blocks.
==4678== checked 117,864 bytes.
==4678==
==4678==
==4678== 13 bytes in 4 blocks are definitely lost in loss record 5 of 13
==4678== at 0x401A639: malloc (vg_replace_malloc.c:149)
==4678== by 0x804D2C4: sfh_argv_add (in /usr/bin/mpirun)
==4678== by 0x804FC19: asc_compat (in /usr/bin/mpirun)
==4678== by 0x804A38B: main (in /usr/bin/mpirun)
==4678==
==4678==
==4678== 100 (52 direct, 48 indirect) bytes in 3 blocks are definitely
lost in loss record 10 of 13
==4678== at 0x401B9F1: realloc (vg_replace_malloc.c:306)
==4678== by 0x804D324: sfh_argv_add (in /usr/bin/mpirun)
==4678== by 0x804D848: sfh_argv_break_quoted (in /usr/bin/mpirun)
==4678== by 0x804F484: parseline (in /usr/bin/mpirun)
==4678== by 0x804F2FD: asc_bufparse (in /usr/bin/mpirun)
==4678== by 0x804B1D0: build_app (in /usr/bin/mpirun)
==4678== by 0x804A6AC: main (in /usr/bin/mpirun)
==4678==
==4678== LEAK SUMMARY:
==4678== definitely lost: 65 bytes in 7 blocks.
==4678== indirectly lost: 48 bytes in 8 blocks.
==4678== possibly lost: 0 bytes in 0 blocks.
==4678== still reachable: 1,159 bytes in 27 blocks.
==4678== suppressed: 0 bytes in 0 blocks.
==4678== Reachable blocks (those to which a pointer was found) are not
shown.
==4678== To see them, rerun with: --show-reachable=yes
Many thanks
>Ton
|