LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Neil Storer (Neil.Storer_at_[hidden])
Date: 2005-03-22 04:52:23


Ravi,

You seem to be sending 1 MPI_DOUBLE value from the MAIN program of each
of your tasks (except rank0) to rank0, but only doing a single receive
in rank0. This will leave the remaining MPI_DOUBLE-size messages in the
buffer. The MPI_Bcast on rank0 will get one of these buffered messages
and you are now totally out of step with your SEND/REVCs.

Having said that, I can't see how it would ever work (with your smaller
problem size) and why the failure appears to be in the MAIN program.
Unless you haven't shown us the full MPI code in the MAIN program.

Kumar, Ravi Ranjan wrote:

>Hello,
>
>You are right! I was using MPI_Send/MPI_Recv in the main body of the program. I
>changed them to MPI_Isend/MPI_Irecv, see below:
>---------------------------------------------------------
>if(rank != 0)
>{
>local_max_error = max_error_norm();
>MPI_Isend(&local_max_error,1,MPI_DOUBLE,0,RedBlackIter,MPI_COMM_WORLD,&request);
>MPI_Request_free(&request);
>}
>
>if(rank == 0)
>{
>GlobalMaxErr = max_error_norm();
>MPI_Irecv
>(&local_max_error,1,MPI_DOUBLE,MPI_ANY_SOURCE,RedBlackIter,MPI_COMM_WORLD,&reque
>st);
>MPI_Wait(&request,&status);
>if(local_max_error > GlobalMaxErr) GlobalMaxErr = local_max_error;
>}
>
>
>MPI_Bcast(&GlobalMaxErr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
>--------------------------------------------------------------------
>
>
>Even after replacing with non-blocking send/recv, my code still hangs for
>larger size array with diff error msg:
>
>
>
>MPI_Wait: message truncated (rank 0, MPI_COMM_WORLD)
>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>Rank (0, MPI_COMM_WORLD): - MPI_Wait()
>Rank (0, MPI_COMM_WORLD): - main()
>-----------------------------------------------------------------------------
>
>One of the processes started by mpirun has exited with a nonzero exit
>code. This typically indicates that the process finished in error.
>If your process did not finish in error, be sure to include a "return
>0" or "exit(0)" in your C code before exiting the application.
>
>PID 15757 failed on node n0 with exit status 1.
>-----------------------------------------------------------------------------
>
>What is the reason for this? Please help me out.
>
>Thanks!
>Ravi R. Kumar
>
>
>
>
>Quoting Brian Barrett <brbarret_at_[hidden]>:
>
>
>>On Mar 20, 2005, at 1:22 AM, Kumar, Ravi Ranjan wrote:
>>
>>
>>>Below is the subroutine I am using for data exchange between different
>>>processes. In my code, I need to solve for 101x101x101 points in a 3D
>>>domain.
>>>For this I defined a 3D array T[101][101][10] dynamically and to
>>>parallelize
>>>the problem I divided T[Nz][Nx][Ny], along Nz, into several slices.
>>>Each
>>>processor works on a slice and needs interface data from neighbouring
>>>nodes.
>>>For exchanging interface data, I am using non-blocking
>>>MPI_Send/MPI_Recv, see
>>>the subroutine below:
>>>
>>Your error message earlier indicated that the error message was coming
>>from a call to MPI_Recv. Your function only calls MPI_Irecv, so that
>>would seem to indicate that your error message is not coming from this
>>function. So you are going to need to look at the rest of your
>>application for where the source of the error is.
>>
>>Hope this helps,
>>
>>Brian
>>
>>
>>
>>>void exchange_interface_data_T(.....)
>>>{
>>>
>>> MPI_Status status;
>>> MPI_Request request;
>>>
>>>
>>> if(rank%2==0 && rank != num_processes-1){
>>> MPI_Isend(&T[local_Nz][0][0], Nx*Ny, MPI_DOUBLE, rank+1,
>>>comm_tag,
>>>MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>> else if(rank%2==1){
>>> MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>>1,comm_tag,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>>
>>> if(rank%2==1){
>>> MPI_Isend(&T[1][0][0],Nx*Ny, MPI_DOUBLE, rank-1,comm_tag+51,
>>>MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>> else if(rank%2==0 && rank != num_processes-1){
>>> MPI_Irecv(&T[local_Nz+1][0]
>>>[0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+51,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>>
>>> if(rank%2==0 && rank != 0){
>>> MPI_Isend(&T[1][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>>1,comm_tag+101,MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>>
>>> else if(rank%2==1 && rank != num_processes-1){
>>> MPI_Irecv(&T[local_Nz+1][0]
>>>[0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+101,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>> if(rank%2==1 && rank != num_processes-1){
>>> MPI_Isend(&T[local_Nz][0]
>>>[0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+201,MPI_COMM_WORLD,&request);
>>> MPI_Request_free(&request);
>>> }
>>>
>>>
>>> else if(rank%2==0 && rank != 0){
>>> MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>>1,comm_tag+201,MPI_COMM_WORLD,&request);
>>> MPI_Wait(&request,&status);
>>> }
>>>
>>>
>>>}
>>>
>>>This is how I am approaching data exchange between neighbouring nodes
>>>(slices).
>>>Am I doing something wrong in data exchange? Pls suggest me.
>>>
>>>Thanks a lot!
>>>Ravi R. Kumar
>>>
>>>
>>>
>>>
>>>Quoting Brian Barrett <brbarret_at_[hidden]>:
>>>
>>>
>>>>On Mar 20, 2005, at 12:12 AM, Kumar, Ravi Ranjan wrote:
>>>>
>>>>
>>>>>I wrote a code in C++ using MPI. It works fine and gives correct
>>>>>result for
>>>>>smaller 3D array size case for e.g. T[51][51][51]. However, my code
>>>>>hangs when
>>>>>I try to run the same for larger size case i.e T[101][101][101] with
>>>>>an error
>>>>>message as below:
>>>>>
>>>>>MPI_Recv: message truncated (rank 0, MPI_COMM_WORLD)
>>>>>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>>>Rank (0, MPI_COMM_WORLD): - MPI_Recv()
>>>>>Rank (0, MPI_COMM_WORLD): - main()
>>>>>
>>>><snip>
>>>>
>>>>>I read sometime ago that this may be due to mismatch in number of
>>>>>data
>>>>>sent and
>>>>>number of data received in MPI_Send/MPI_Recv process. I have checked
>>>>>this thing
>>>>>many times and found no mismatch in number of data exchanged, still I
>>>>>am
>>>>>getting this error. What can be the reason for this? Could anyone
>>>>>please
>>>>>explain?
>>>>>
>>>>The reason is exactly as you surmised. For some reason, a message has
>>>>arrived that is bigger than the buffer you posted. It's hard to tell
>>>>why this is occurring, but I would look carefully at your send/recv
>>>>pairs again. These are hard ones to debug, as LAM is in an error
>>>>condition and doesn't give you much information about what happened.
>>>>I
>>>>notice you are using blocking receives - this helps a little bit, in
>>>>that you can print out what messages are being printed (and their
>>>>sizes) and you can print out the size of the buffer you are providing
>>>>to MPI_Recv. If you send a big message and post an ANY_SOURCE recv,
>>>>Murphy's law pretty much guarantees it will happen in the worst order
>>>>possible.
>>>>
>>>>
>>>>Hope that helps,
>>>>
>>>>Brian
>>>>
>>--
>> Brian Barrett
>> LAM/MPI developer and all around nice guy
>> Have a LAM/MPI day: http://www.lam-mpi.org/
>>
>>_______________________________________________
>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
+-----------------+---------------------------------+------------------+
| Neil Storer     |    Head: Systems S/W Section    | Operations Dept. |
+-----------------+---------------------------------+------------------+
| ECMWF,          | email: neil.storer_at_[hidden]    |    //=\\  //=\\  |
| Shinfield Park, | Tel:   (+44 118) 9499353        |   //   \\//   \\ |
| Reading,        |        (+44 118) 9499000 x 2353 | ECMWF            |
| Berkshire,      | Fax:   (+44 118) 9869450        | ECMWF            |
| RG2 9AX,        |                                 |   \\   //\\   // |
| UK              | URL:   http://www.ecmwf.int/    |    \\=//  \\=//  |
+--+--------------+---------------------------------+----------------+-+
   | ECMWF is the European Centre for Medium-Range Weather Forecasts |
   +-----------------------------------------------------------------+