Neil Storer wrote:
> Ravi,
>
> You seem to be sending 1 MPI_DOUBLE value from the MAIN program of each
> of your tasks (except rank0) to rank0, but only doing a single receive
> in rank0. This will leave the remaining MPI_DOUBLE-size messages in the
> buffer. The MPI_Bcast on rank0 will get one of these buffered messages
> and you are now totally out of step with your SEND/REVCs.
I am not a LAM developer and I have not looked at the LAM source code,
but I doubt very much that LAM does this. This would be in clear
violation to the standard. Something like this may lead to deadlock,
but a Bcast should never receive data sent with a send.
Dave.
>
> Having said that, I can't see how it would ever work (with your smaller
> problem size) and why the failure appears to be in the MAIN program.
> Unless you haven't shown us the full MPI code in the MAIN program.
>
>
>
>
> Kumar, Ravi Ranjan wrote:
>
>>Hello,
>>
>>You are right! I was using MPI_Send/MPI_Recv in the main body of the program. I
>>changed them to MPI_Isend/MPI_Irecv, see below:
>>---------------------------------------------------------
>>if(rank != 0)
>>{
>>local_max_error = max_error_norm();
>>MPI_Isend(&local_max_error,1,MPI_DOUBLE,0,RedBlackIter,MPI_COMM_WORLD,&request);
>>MPI_Request_free(&request);
>>}
>>
>>if(rank == 0)
>>{
>>GlobalMaxErr = max_error_norm();
>>MPI_Irecv
>>(&local_max_error,1,MPI_DOUBLE,MPI_ANY_SOURCE,RedBlackIter,MPI_COMM_WORLD,&reque
>>st);
>>MPI_Wait(&request,&status);
>>if(local_max_error > GlobalMaxErr) GlobalMaxErr = local_max_error;
>>}
>>
>>
>>MPI_Bcast(&GlobalMaxErr,1,MPI_DOUBLE,0,MPI_COMM_WORLD);
>>--------------------------------------------------------------------
>>
>>
>>Even after replacing with non-blocking send/recv, my code still hangs for
>>larger size array with diff error msg:
>>
>>
>>
>>MPI_Wait: message truncated (rank 0, MPI_COMM_WORLD)
>>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>Rank (0, MPI_COMM_WORLD): - MPI_Wait()
>>Rank (0, MPI_COMM_WORLD): - main()
>>-----------------------------------------------------------------------------
>>
>>One of the processes started by mpirun has exited with a nonzero exit
>>code. This typically indicates that the process finished in error.
>>If your process did not finish in error, be sure to include a "return
>>0" or "exit(0)" in your C code before exiting the application.
>>
>>PID 15757 failed on node n0 with exit status 1.
>>-----------------------------------------------------------------------------
>>
>>What is the reason for this? Please help me out.
>>
>>Thanks!
>>Ravi R. Kumar
>>
>>
>>
>>
>>Quoting Brian Barrett <brbarret_at_[hidden]>:
>>
>>
>>>On Mar 20, 2005, at 1:22 AM, Kumar, Ravi Ranjan wrote:
>>>
>>>
>>>>Below is the subroutine I am using for data exchange between different
>>>>processes. In my code, I need to solve for 101x101x101 points in a 3D
>>>>domain.
>>>>For this I defined a 3D array T[101][101][10] dynamically and to
>>>>parallelize
>>>>the problem I divided T[Nz][Nx][Ny], along Nz, into several slices.
>>>>Each
>>>>processor works on a slice and needs interface data from neighbouring
>>>>nodes.
>>>>For exchanging interface data, I am using non-blocking
>>>>MPI_Send/MPI_Recv, see
>>>>the subroutine below:
>>>>
>>>Your error message earlier indicated that the error message was coming
>>>from a call to MPI_Recv. Your function only calls MPI_Irecv, so that
>>>would seem to indicate that your error message is not coming from this
>>>function. So you are going to need to look at the rest of your
>>>application for where the source of the error is.
>>>
>>>Hope this helps,
>>>
>>>Brian
>>>
>>>
>>>
>>>>void exchange_interface_data_T(.....)
>>>>{
>>>>
>>>> MPI_Status status;
>>>> MPI_Request request;
>>>>
>>>>
>>>> if(rank%2==0 && rank != num_processes-1){
>>>> MPI_Isend(&T[local_Nz][0][0], Nx*Ny, MPI_DOUBLE, rank+1,
>>>>comm_tag,
>>>>MPI_COMM_WORLD,&request);
>>>> MPI_Request_free(&request);
>>>> }
>>>>
>>>>
>>>> else if(rank%2==1){
>>>> MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>>>1,comm_tag,MPI_COMM_WORLD,&request);
>>>> MPI_Wait(&request,&status);
>>>> }
>>>>
>>>>
>>>> if(rank%2==1){
>>>> MPI_Isend(&T[1][0][0],Nx*Ny, MPI_DOUBLE, rank-1,comm_tag+51,
>>>>MPI_COMM_WORLD,&request);
>>>> MPI_Request_free(&request);
>>>> }
>>>>
>>>>
>>>> else if(rank%2==0 && rank != num_processes-1){
>>>> MPI_Irecv(&T[local_Nz+1][0]
>>>>[0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+51,MPI_COMM_WORLD,&request);
>>>> MPI_Wait(&request,&status);
>>>> }
>>>>
>>>>
>>>> if(rank%2==0 && rank != 0){
>>>> MPI_Isend(&T[1][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>>>1,comm_tag+101,MPI_COMM_WORLD,&request);
>>>> MPI_Request_free(&request);
>>>> }
>>>>
>>>>
>>>>
>>>> else if(rank%2==1 && rank != num_processes-1){
>>>> MPI_Irecv(&T[local_Nz+1][0]
>>>>[0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+101,MPI_COMM_WORLD,&request);
>>>> MPI_Wait(&request,&status);
>>>> }
>>>>
>>>> if(rank%2==1 && rank != num_processes-1){
>>>> MPI_Isend(&T[local_Nz][0]
>>>>[0],Nx*Ny,MPI_DOUBLE,rank+1,comm_tag+201,MPI_COMM_WORLD,&request);
>>>> MPI_Request_free(&request);
>>>> }
>>>>
>>>>
>>>> else if(rank%2==0 && rank != 0){
>>>> MPI_Irecv(&T[0][0][0],Nx*Ny,MPI_DOUBLE,rank-
>>>>1,comm_tag+201,MPI_COMM_WORLD,&request);
>>>> MPI_Wait(&request,&status);
>>>> }
>>>>
>>>>
>>>>}
>>>>
>>>>This is how I am approaching data exchange between neighbouring nodes
>>>>(slices).
>>>>Am I doing something wrong in data exchange? Pls suggest me.
>>>>
>>>>Thanks a lot!
>>>>Ravi R. Kumar
>>>>
>>>>
>>>>
>>>>
>>>>Quoting Brian Barrett <brbarret_at_[hidden]>:
>>>>
>>>>
>>>>>On Mar 20, 2005, at 12:12 AM, Kumar, Ravi Ranjan wrote:
>>>>>
>>>>>
>>>>>>I wrote a code in C++ using MPI. It works fine and gives correct
>>>>>>result for
>>>>>>smaller 3D array size case for e.g. T[51][51][51]. However, my code
>>>>>>hangs when
>>>>>>I try to run the same for larger size case i.e T[101][101][101] with
>>>>>>an error
>>>>>>message as below:
>>>>>>
>>>>>>MPI_Recv: message truncated (rank 0, MPI_COMM_WORLD)
>>>>>>Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>>>>Rank (0, MPI_COMM_WORLD): - MPI_Recv()
>>>>>>Rank (0, MPI_COMM_WORLD): - main()
>>>>>>
>>>>><snip>
>>>>>
>>>>>>I read sometime ago that this may be due to mismatch in number of
>>>>>>data
>>>>>>sent and
>>>>>>number of data received in MPI_Send/MPI_Recv process. I have checked
>>>>>>this thing
>>>>>>many times and found no mismatch in number of data exchanged, still I
>>>>>>am
>>>>>>getting this error. What can be the reason for this? Could anyone
>>>>>>please
>>>>>>explain?
>>>>>>
>>>>>The reason is exactly as you surmised. For some reason, a message has
>>>>>arrived that is bigger than the buffer you posted. It's hard to tell
>>>>>why this is occurring, but I would look carefully at your send/recv
>>>>>pairs again. These are hard ones to debug, as LAM is in an error
>>>>>condition and doesn't give you much information about what happened.
>>>>>I
>>>>>notice you are using blocking receives - this helps a little bit, in
>>>>>that you can print out what messages are being printed (and their
>>>>>sizes) and you can print out the size of the buffer you are providing
>>>>>to MPI_Recv. If you send a big message and post an ANY_SOURCE recv,
>>>>>Murphy's law pretty much guarantees it will happen in the worst order
>>>>>possible.
>>>>>
>>>>>
>>>>>Hope that helps,
>>>>>
>>>>>Brian
>>>>>
>>>--
>>> Brian Barrett
>>> LAM/MPI developer and all around nice guy
>>> Have a LAM/MPI day: http://www.lam-mpi.org/
>>>
>>>_______________________________________________
>>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>>_______________________________________________
>>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> --
> +-----------------+---------------------------------+------------------+
> | Neil Storer | Head: Systems S/W Section | Operations Dept. |
> +-----------------+---------------------------------+------------------+
> | ECMWF, | email: neil.storer_at_[hidden] | //=\\ //=\\ |
> | Shinfield Park, | Tel: (+44 118) 9499353 | // \\// \\ |
> | Reading, | (+44 118) 9499000 x 2353 | ECMWF |
> | Berkshire, | Fax: (+44 118) 9869450 | ECMWF |
> | RG2 9AX, | | \\ //\\ // |
> | UK | URL: http://www.ecmwf.int/ | \\=// \\=// |
> +--+--------------+---------------------------------+----------------+-+
> | ECMWF is the European Centre for Medium-Range Weather Forecasts |
> +-----------------------------------------------------------------+
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Dr. David Cronk, Ph.D. phone: (865) 974-3735
Research Leader fax: (865) 974-8296
Innovative Computing Lab http://www.cs.utk.edu/~cronk
University of Tennessee, Knoxville
|