LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-02-02 07:50:07


On Jan 31, 2006, at 1:41 AM, Angel Tsankov wrote:

>> From time to time I get these messages on stderr from LAM 7.1.1
> running on a cluster of 4x dual G4 PowerPCs:
>
> ----------------------------------------------------------------------
> ---
>
> One of the processes started by mpirun has exited with a nonzero exit
>
> code. This typically indicates that the process finished in error.
>
> If your process did not finish in error, be sure to include a "return
>
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 26972 failed on node n2 (<IP address omitted>) due to signal 4.
>
> ----------------------------------------------------------------------
> ---

Signal #4 on both Linux and OS X is SIGILL -- illegal instruction.

> I also get this on stdout:
>
> MPI_Recv: process in local group is dead (rank 0, comm 3)
>
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>
> Rank (0, MPI_COMM_WORLD): - MPI_Recv()
>
> Rank (0, MPI_COMM_WORLD): - MPI_Gatherv()
>
> Rank (0, MPI_COMM_WORLD): - MPI_Allgather()
>
> Rank (0, MPI_COMM_WORLD): - MPI_Allreduce()
>
> Rank (0, MPI_COMM_WORLD): - main()

This indicates that MPI processes 0 and 6 (I snipped some of your
output) have realized that a peer process died unexpectedly -- 0 and
6 realized this while they were in MPI_Allreduce.

> The above message relate to probably the same problem occurring in two
> sequential executions of the same program in the same way. After those
> failures the program has been executed successfully.
>
> This problem seems to happen at random. Does anyone have an idea what
> could be wrong? I think the problem occurs on the same machine every
> time. I also have the error and warning messages from configuring,
> building and installing LAM on this cluster. Could they help find the
> problem?

Can you run this application through a memory-checking debugger? If
you have access to an x86-based machine, you can use the valgrind
memory-checking debugger.

It's hard to speculate on what the cause is with this information --
running your application through a memory-checking debugger is
typically the first step. See the LAM FAQ for information about this.

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/