On Mar 20, 2005, at 3:35 PM, Kumar, Ravi Ranjan wrote:
> You are right! I was using MPI_Send/MPI_Recv in the main body of the
> program. I
> changed them to MPI_Isend/MPI_Irecv, see below:
That wasn't my suggestion at all. In fact, since you are using
non-blocking sends exactly like basic sends (calling
MPI_Isend/MPI_Irecv immediately followed by an MPI_Wait gains you
nothing), there was little chance of that change fixing anything.
Stop and think about what your application is telling you. When your
main body parts of the program were using MPI_Recv, you got an error
saying that an over-sized message was received. When you changed that
to MPI_Irecv/MPI_Wait, the error message changed to tell you that an
over-sized message was received in MPI_Wait. This would seem to
indicate that the receive in the main body of your application received
a message you didn't expect. Changing the format of the receive isn't
going to fix that - you need to figure out why that message is arriving
when you don't expect.
This isn't an area where I can help you - I don't debug other people's
MPI applications for them. Well, not for free anyway ;). I would
recommend that you re-analyze your MPI communication - who is sending
what messages, when are they being received. Use some printf debugging
to verify what you think is happening actually is. Maybe even use MPE
from Argonne to visualize your communication patterns and verify them.
Finally, if you still can't find it, see if you can find someone in
your department with a license for the TotalView debugger and use that
to debug your application.
Hope this helps,
Brian
> ---------------------------------------------------------
> if(rank != 0)
> {
> local_max_error = max_error_norm();
> MPI_Isend(&local_max_error,1,MPI_DOUBLE,0,RedBlackIter,MPI_COMM_WORLD,&
> request);
> MPI_Request_free(&request);
> }
>
> if(rank == 0)
> {
> GlobalMaxErr = max_error_norm();
> MPI_Irecv
> (&local_max_error,1,MPI_DOUBLE,MPI_ANY_SOURCE,RedBlackIter,MPI_COMM_WOR
> LD,&reque
> st);
> MPI_Wait(&request,&status);
> if(local_max_error > GlobalMaxErr) GlobalMaxErr = local_max_error;
> }
>
>
> MPI_Bcast(&GlobalMaxErr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
> --------------------------------------------------------------------
>
>
> Even after replacing with non-blocking send/recv, my code still hangs
> for
> larger size array with diff error msg:
>
>
>
> MPI_Wait: message truncated (rank 0, MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Wait()
> Rank (0, MPI_COMM_WORLD): - main()
> -----------------------------------------------------------------------
> ------
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 15757 failed on node n0 with exit status 1.
> -----------------------------------------------------------------------
> ------
>
> What is the reason for this? Please help me out.
>
> Quoting Brian Barrett <brbarret_at_[hidden]>:
>
>> On Mar 20, 2005, at 1:22 AM, Kumar, Ravi Ranjan wrote:
>>
>>> Below is the subroutine I am using for data exchange between
>>> different
>>> processes. In my code, I need to solve for 101x101x101 points in a 3D
>>> domain.
>>> For this I defined a 3D array T[101][101][10] dynamically and to
>>> parallelize
>>> the problem I divided T[Nz][Nx][Ny], along Nz, into several slices.
>>> Each
>>> processor works on a slice and needs interface data from neighbouring
>>> nodes.
>>> For exchanging interface data, I am using non-blocking
>>> MPI_Send/MPI_Recv, see
>>> the subroutine below:
>>
>> Your error message earlier indicated that the error message was coming
>> from a call to MPI_Recv. Your function only calls MPI_Irecv, so that
>> would seem to indicate that your error message is not coming from this
>> function. So you are going to need to look at the rest of your
>> application for where the source of the error is.
>>
>>> void exchange_interface_data_T(.....)
>>> {
>>>
>>> MPI_Status status;
>>> MPI_Request request;
>>>
>>>
>>> if(rank%2==0 && rank != num_processes-1){
>>> MPI_Isend(&T[local_Nz][0][0], Nx*Ny, MPI_DOUBLE, rank+1,
>>> comm_tag,
>>> MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>> else if(rank%2==1){
>>> MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>> 1,comm_tag,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>>
>>> if(rank%2==1){
>>> MPI_Isend(&T[1][0][0],Nx*Ny, MPI_DOUBLE, rank-1,comm_tag+51,
>>> MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>> else if(rank%2==0 && rank != num_processes-1){
>>> MPI_Irecv(&T[local_Nz+1][0]
>>> [0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+51,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>>
>>> if(rank%2==0 && rank != 0){
>>> MPI_Isend(&T[1][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>> 1,comm_tag+101,MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>>
>>> else if(rank%2==1 && rank != num_processes-1){
>>> MPI_Irecv(&T[local_Nz+1][0]
>>> [0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+101,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>> if(rank%2==1 && rank != num_processes-1){
>>> MPI_Isend(&T[local_Nz][0]
>>> [0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+201,MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>> else if(rank%2==0 && rank != 0){
>>> MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>> 1,comm_tag+201,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>>
>>> }
>>>
>>> This is how I am approaching data exchange between neighbouring nodes
>>> (slices).
>>> Am I doing something wrong in data exchange? Pls suggest me.
>>>
>>> Quoting Brian Barrett <brbarret_at_[hidden]>:
>>>
>>>> On Mar 20, 2005, at 12:12 AM, Kumar, Ravi Ranjan wrote:
>>>>
>>>>> I wrote a code in C++ using MPI. It works fine and gives correct
>>>>> result for
>>>>> smaller 3D array size case for e.g. T[51][51][51]. However, my code
>>>>> hangs when
>>>>> I try to run the same for larger size case i.e T[101][101][101]
>>>>> with
>>>>> an error
>>>>> message as below:
>>>>>
>>>>> MPI_Recv: message truncated (rank 0, MPI_COMM_WORLD)
>>>>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>>> Rank (0, MPI_COMM_WORLD): - MPI_Recv()
>>>>> Rank (0, MPI_COMM_WORLD): - main()
>>>>
>>>> <snip>
>>>>
>>>>> I read sometime ago that this may be due to mismatch in number of
>>>>> data
>>>>> sent and
>>>>> number of data received in MPI_Send/MPI_Recv process. I have
>>>>> checked
>>>>> this thing
>>>>> many times and found no mismatch in number of data exchanged,
>>>>> still I
>>>>> am
>>>>> getting this error. What can be the reason for this? Could anyone
>>>>> please
>>>>> explain?
>>>>
>>>> The reason is exactly as you surmised. For some reason, a message
>>>> has
>>>> arrived that is bigger than the buffer you posted. It's hard to
>>>> tell
>>>> why this is occurring, but I would look carefully at your send/recv
>>>> pairs again. These are hard ones to debug, as LAM is in an error
>>>> condition and doesn't give you much information about what happened.
>>>> I
>>>> notice you are using blocking receives - this helps a little bit, in
>>>> that you can print out what messages are being printed (and their
>>>> sizes) and you can print out the size of the buffer you are
>>>> providing
>>>> to MPI_Recv. If you send a big message and post an ANY_SOURCE recv,
>>>> Murphy's law pretty much guarantees it will happen in the worst
>>>> order
>>>> possible.
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|