Hi,
It is difficult to pinpoint the exact cause of failure without looking at
your code. However, since you are seeing this mode of failure only when
you run on multiple nodes, the first thing to do would be to make sure
that you are not using different versions of LAM on the nodes.
If this is not the case, then based on the error message you got, I see
that you are having problems with the call to MPI_Sendrecv in rank 0.
MPI_Wait: process in local group is dead (rank 0,MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Wait()
Rank (0, MPI_COMM_WORLD): - MPI_Sendrecv()
Rank (0, MPI_COMM_WORLD): - main()
So, it is possible that rank 0 (in MPI_COMM_WORLD) posted an
MPI_Sendrecv() to a peer that might have already finished invoking
MPI_Finalize(). Check your code, using a debugger or printf-like
statements, to see if this is the case.
Using a debugger or printf-like statements might give you an idea of
*where* your problem is occuring; then you can figure out *why* it is
happening, and *how* to fix it.
Hope this helps.
--
Sriram Sankaran
email: ssankara_at_[hidden]
http://www.lam-mpi.org/
Thus spake zongenhsu, on Jul 1:
>Date: Tue, 1 Jul 2003 10:54:03 -0700 (PDT)
>From: zongenhsu <zongenhsu_at_[hidden]>
>Reply-To: General LAM/MPI mailing list <lam_at_[hidden]>
>To: lam_at_[hidden]
>Subject: LAM: problem in running mpi program
>
>Hi, everyone
>
>I am doing some computation using 3 CPUs (2 on a
>dual-processor machine and 1 on another
>single-processor machine). There is no problem when I
>run this job with 3 processes only on the 2 CPUs of
>the dual-processor machine, but failed on the 3 CPUs.
>The error message is as the following. The problem
>here is that, no problem is found in the first two
>iterations, but failed on the third iteration. Anybody
>has any suggestion? Thank you.
>
>--------------------------------
>
> 1: 0.26401471785025D+00 : 2 : 63 29 1
>res_ave: 0.34907989208128D-03
> 2: 0.24853339079718D+00 : 2 : 63 29 1
>res_ave: 0.35836744617995D-03
>MPI_Wait: process in local group is dead (rank 0,
>MPI_COMM_WORLD)
>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>Rank (0, MPI_COMM_WORLD): - MPI_Wait()
>Rank (0, MPI_COMM_WORLD): - MPI_Sendrecv()
>Rank (0, MPI_COMM_WORLD): - main()
>-----------------------------------------------------------------------------
>
>One of the processes started by mpirun has exited with
>a nonzero exit
>code. This typically indicates that the process
>finished in error.
>If your process did not finish in error, be sure to
>include a "return
>0" or "exit(0)" in your C code before exiting the
>application.
>
>PID 29584 failed on node n1 with exit status 1.
>-----------------------------------------------------------------------------
>
>
>
>__________________________________
>Do you Yahoo!?
>SBC Yahoo! DSL - Now only $29.95 per month!
>http://sbc.yahoo.com
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|