LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Angel Tsankov (fn42551_at_[hidden])
Date: 2006-01-31 02:41:13


Hallo,

>From time to time I get these messages on stderr from LAM 7.1.1
running on a cluster of 4x dual G4 PowerPCs:

-----------------------------------------------------------------------------

One of the processes started by mpirun has exited with a nonzero exit

code. This typically indicates that the process finished in error.

If your process did not finish in error, be sure to include a "return

0" or "exit(0)" in your C code before exiting the application.

PID 26972 failed on node n2 (<IP address omitted>) due to signal 4.

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------

One of the processes started by mpirun has exited with a nonzero exit

code. This typically indicates that the process finished in error.

If your process did not finish in error, be sure to include a "return

0" or "exit(0)" in your C code before exiting the application.

PID 26974 failed on node n2 (<IP address omitted>) due to signal 4.

-----------------------------------------------------------------------------

I also get this on stdout:

MPI_Recv: process in local group is dead (rank 0, comm 3)

Rank (0, MPI_COMM_WORLD): Call stack within LAM:

Rank (0, MPI_COMM_WORLD): - MPI_Recv()

Rank (0, MPI_COMM_WORLD): - MPI_Gatherv()

Rank (0, MPI_COMM_WORLD): - MPI_Allgather()

Rank (0, MPI_COMM_WORLD): - MPI_Allreduce()

Rank (0, MPI_COMM_WORLD): - main()

MPI_Wait: process in local group is dead (rank 3, MPI_COMM_WORLD)

MPI_Recv: process in local group is dead (rank 3, comm 3)

Rank (6, MPI_COMM_WORLD): Call stack within LAM:

Rank (6, MPI_COMM_WORLD): - MPI_Recv()

Rank (6, MPI_COMM_WORLD): - MPI_Bcast()

Rank (6, MPI_COMM_WORLD): - MPI_Allgather()

Rank (6, MPI_COMM_WORLD): - MPI_Allreduce()

Rank (6, MPI_COMM_WORLD): - main()

Rank (3, MPI_COMM_WORLD): Call stack within LAM:

Rank (3, MPI_COMM_WORLD): - MPI_Wait()

Rank (3, MPI_COMM_WORLD): - main()

The above message relate to probably the same problem occurring in two
sequential executions of the same program in the same way. After those
failures the program has been executed successfully.

This problem seems to happen at random. Does anyone have an idea what
could be wrong? I think the problem occurs on the same machine every
time. I also have the error and warning messages from configuring,
building and installing LAM on this cluster. Could they help find the
problem?

Regards,

Angel Tsankov