LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Angel Tsankov (fn42551_at_[hidden])
Date: 2006-02-02 10:41:09


>>> From time to time I get these messages on stderr from LAM 7.1.1
>> running on a cluster of 4x dual G4 PowerPCs:
>>
>> ----------------------------------------------------------------------
>> ---
>>
>> One of the processes started by mpirun has exited with a nonzero
>> exit
>>
>> code. This typically indicates that the process finished in error.
>>
>> If your process did not finish in error, be sure to include a
>> "return
>>
>> 0" or "exit(0)" in your C code before exiting the application.
>>
>> PID 26972 failed on node n2 (<IP address omitted>) due to signal 4.
>>
>> ----------------------------------------------------------------------
>> ---
>
> Signal #4 on both Linux and OS X is SIGILL -- illegal instruction.
>
>> I also get this on stdout:
>>
>> MPI_Recv: process in local group is dead (rank 0, comm 3)
>>
>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>
>> Rank (0, MPI_COMM_WORLD): - MPI_Recv()
>>
>> Rank (0, MPI_COMM_WORLD): - MPI_Gatherv()
>>
>> Rank (0, MPI_COMM_WORLD): - MPI_Allgather()
>>
>> Rank (0, MPI_COMM_WORLD): - MPI_Allreduce()
>>
>> Rank (0, MPI_COMM_WORLD): - main()
>
> This indicates that MPI processes 0 and 6 (I snipped some of your
> output) have realized that a peer process died unexpectedly -- 0 and
> 6 realized this while they were in MPI_Allreduce.
>

The last MPI function I call before MPI_Finalize is MPI_Allreduce. As
far as I can see, it has been implemented in terms of other MPI
funcions.
Having this in mind, is it possible that one of the MPI processes
exits MPI_Allreduce before all the other processes and calls
MPI_Finalize before they have finished their call to MPI_Allreduce? If
so, then later on one of the other processes could detect that one of
the processes is missing. Is it possible that LAM fails with SIGILL in
this situation?

> Can you run this application through a memory-checking debugger? If
> you have access to an x86-based machine, you can use the valgrind
> memory-checking debugger.
>

I have access to a single x86 workstation only. Does it make sense to
run multiple MPI processes on a single-cpu machine with Valgrind?
The cluster where LAM 7.1.1 is installed and where the MPI program
fails as explained in my original post consists of G4 PowerPCs. Can I
run Valgrind on this cluster? The manual mentions that Valgrind can
run on PPC.