Hello,
You are right! I was using MPI_Send/MPI_Recv in the main body of the program. I
changed them to MPI_Isend/MPI_Irecv, see below:
---------------------------------------------------------
if(rank != 0)
{
local_max_error = max_error_norm();
MPI_Isend(&local_max_error,1,MPI_DOUBLE,0,RedBlackIter,MPI_COMM_WORLD,&request);
MPI_Request_free(&request);
}
if(rank == 0)
{
GlobalMaxErr = max_error_norm();
MPI_Irecv
(&local_max_error,1,MPI_DOUBLE,MPI_ANY_SOURCE,RedBlackIter,MPI_COMM_WORLD,&reque
st);
MPI_Wait(&request,&status);
if(local_max_error > GlobalMaxErr) GlobalMaxErr = local_max_error;
}
MPI_Bcast(&GlobalMaxErr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
--------------------------------------------------------------------
Even after replacing with non-blocking send/recv, my code still hangs for
larger size array with diff error msg:
MPI_Wait: message truncated (rank 0, MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Wait()
Rank (0, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 15757 failed on node n0 with exit status 1.
-----------------------------------------------------------------------------
What is the reason for this? Please help me out.
Thanks!
Ravi R. Kumar
Quoting Brian Barrett <brbarret_at_[hidden]>:
> On Mar 20, 2005, at 1:22 AM, Kumar, Ravi Ranjan wrote:
>
> > Below is the subroutine I am using for data exchange between different
> > processes. In my code, I need to solve for 101x101x101 points in a 3D
> > domain.
> > For this I defined a 3D array T[101][101][10] dynamically and to
> > parallelize
> > the problem I divided T[Nz][Nx][Ny], along Nz, into several slices.
> > Each
> > processor works on a slice and needs interface data from neighbouring
> > nodes.
> > For exchanging interface data, I am using non-blocking
> > MPI_Send/MPI_Recv, see
> > the subroutine below:
>
> Your error message earlier indicated that the error message was coming
> from a call to MPI_Recv. Your function only calls MPI_Irecv, so that
> would seem to indicate that your error message is not coming from this
> function. So you are going to need to look at the rest of your
> application for where the source of the error is.
>
> Hope this helps,
>
> Brian
>
>
> > void exchange_interface_data_T(.....)
> > {
> >
> > MPI_Status status;
> > MPI_Request request;
> >
> >
> > if(rank%2==0 && rank != num_processes-1){
> > MPI_Isend(&T[local_Nz][0][0], Nx*Ny, MPI_DOUBLE, rank+1,
> > comm_tag,
> > MPI_COMM_WORLD,&request);
> > MPI_Request_free(&request);
> > }
> >
> >
> > else if(rank%2==1){
> > MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
> > 1,comm_tag,MPI_COMM_WORLD,&request);
> > MPI_Wait(&request,&status);
> > }
> >
> >
> > if(rank%2==1){
> > MPI_Isend(&T[1][0][0],Nx*Ny, MPI_DOUBLE, rank-1,comm_tag+51,
> > MPI_COMM_WORLD,&request);
> > MPI_Request_free(&request);
> > }
> >
> >
> > else if(rank%2==0 && rank != num_processes-1){
> > MPI_Irecv(&T[local_Nz+1][0]
> > [0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+51,MPI_COMM_WORLD,&request);
> > MPI_Wait(&request,&status);
> > }
> >
> >
> > if(rank%2==0 && rank != 0){
> > MPI_Isend(&T[1][0][0],Nx*Ny,MPI_DOUBLE,rank-
> > 1,comm_tag+101,MPI_COMM_WORLD,&request);
> > MPI_Request_free(&request);
> > }
> >
> >
> >
> > else if(rank%2==1 && rank != num_processes-1){
> > MPI_Irecv(&T[local_Nz+1][0]
> > [0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+101,MPI_COMM_WORLD,&request);
> > MPI_Wait(&request,&status);
> > }
> >
> > if(rank%2==1 && rank != num_processes-1){
> > MPI_Isend(&T[local_Nz][0]
> > [0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+201,MPI_COMM_WORLD,&request);
> > MPI_Request_free(&request);
> > }
> >
> >
> > else if(rank%2==0 && rank != 0){
> > MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
> > 1,comm_tag+201,MPI_COMM_WORLD,&request);
> > MPI_Wait(&request,&status);
> > }
> >
> >
> > }
> >
> > This is how I am approaching data exchange between neighbouring nodes
> > (slices).
> > Am I doing something wrong in data exchange? Pls suggest me.
> >
> > Thanks a lot!
> > Ravi R. Kumar
> >
> >
> >
> >
> > Quoting Brian Barrett <brbarret_at_[hidden]>:
> >
> >> On Mar 20, 2005, at 12:12 AM, Kumar, Ravi Ranjan wrote:
> >>
> >>> I wrote a code in C++ using MPI. It works fine and gives correct
> >>> result for
> >>> smaller 3D array size case for e.g. T[51][51][51]. However, my code
> >>> hangs when
> >>> I try to run the same for larger size case i.e T[101][101][101] with
> >>> an error
> >>> message as below:
> >>>
> >>> MPI_Recv: message truncated (rank 0, MPI_COMM_WORLD)
> >>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> >>> Rank (0, MPI_COMM_WORLD): - MPI_Recv()
> >>> Rank (0, MPI_COMM_WORLD): - main()
> >>
> >> <snip>
> >>
> >>> I read sometime ago that this may be due to mismatch in number of
> >>> data
> >>> sent and
> >>> number of data received in MPI_Send/MPI_Recv process. I have checked
> >>> this thing
> >>> many times and found no mismatch in number of data exchanged, still I
> >>> am
> >>> getting this error. What can be the reason for this? Could anyone
> >>> please
> >>> explain?
> >>
> >> The reason is exactly as you surmised. For some reason, a message has
> >> arrived that is bigger than the buffer you posted. It's hard to tell
> >> why this is occurring, but I would look carefully at your send/recv
> >> pairs again. These are hard ones to debug, as LAM is in an error
> >> condition and doesn't give you much information about what happened.
> >> I
> >> notice you are using blocking receives - this helps a little bit, in
> >> that you can print out what messages are being printed (and their
> >> sizes) and you can print out the size of the buffer you are providing
> >> to MPI_Recv. If you send a big message and post an ANY_SOURCE recv,
> >> Murphy's law pretty much guarantees it will happen in the worst order
> >> possible.
> >>
> >>
> >> Hope that helps,
> >>
> >> Brian
>
> --
> Brian Barrett
> LAM/MPI developer and all around nice guy
> Have a LAM/MPI day: http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|