LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2005-02-16 16:51:20


When you call:
   MPI_Wait(&request,&status);

'request' needs to be initialized from a non-blocking send (such as an
Isend) or receive (such as MPI_Irecv). In the code below you have a
basic MPI_Send which doesn't interact with MPI_Requests. So the SEGSEGV
is coming from the uninitialized argument to MPI_Wait.

Take a look at the man pages for MPI_Send and MPI_Isend for the
differences, but the MPI_Wait can be taken out of the code below to get
it working again.

Josh

On Feb 16, 2005, at 2:12 PM, Kumar, Ravi Ranjan wrote:

> Thanks a lot Jeff!
>
> I could fix the first error (invalid rank) by putting appropriate
> bounds.
> However, the second error (SIGSEGV in MPI_Wait) is still appearing. Is
> it due
> to wrong arguments in MPI_Wait?
>
> This is the subroutine for data exchange:
>
> void exchange_interface_data(int rank, int local_Nz, int comm_tag)
> {
>
> int err;
> MPI_Status status;
> MPI_Request request;
>
> if(rank%2==0 && rank != num_processes-1)
> MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> rank+1,
> comm_tag+rank, MPI_COMM_WORLD);
>
> if(rank%2==1 && rank != 0)
> MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> rank-1,
> comm_tag+rank-1, MPI_COMM_WORLD, &status);
>
>
> MPI_Wait(&request,&status);
>
> if(err==1)
> cout<<"Error in MPI_Send/Recv"<<endl;
>
> if(rank%2==1 && rank != 0)
> MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
> comm_tag+rank+50, MPI_COMM_WORLD);
>
> if(rank%2==0 && rank != num_processes-1)
> MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
> comm_tag+rank+51, MPI_COMM_WORLD, &status);
>
>
> }
>
> Am I using MPI_wait correctly or not. It would be really great if you
> could
> help me.
>
> Thanks a lot!
>
> Ravi R. Kumar
> Research Assitant
> 318 RGAN, RTL
> University of Kentucky
> (859) 257-6336 x 80697
>
>
>
>
> Quoting Jeff Squyres <jsquyres_at_[hidden]>:
>
>> It's hard to debug someone else's application, particularly one as
>> complex as this, especially without seeing the entire application.
>> However, I have a few suggestions for you:
>>
>> 1. It seems that the outputs contain two errors: invalid rank in
>> MPI_SEND and segv in MPI_WAIT. The invalid rank should be pretty easy
>> to track down. I notice that one of your MPI_SEND's goes to rank+1,
>> but you don't do any bounds checking to ensure that rank+1 is a valid
>> rank in MPI_COMM_WORLD (e.g., if you had an odd number of processes).
>>
>> 2. A seg fault in MPI_WAIT is typically (but not always) a symptom of
>> memory badness elsewhere in the application (e.g., a buffer overflow).
>> I highly suggest that you run your application through a
>> memory-checking debugger (such as Valgrind, if you're running on
>> x86/Linux) to see what it can find for you. See the LAM FAQ for
>> details on how to do this.
>>
>> As a final suggestion, unless you're simply trying to segregate your
>> debugging output, I'd remove the calls to MPI_BARRIER -- they don't
>> seem to serve any purpose.
>>
>>
>>
>> On Feb 15, 2005, at 2:07 PM, Kumar, Ravi Ranjan wrote:
>>
>>> Hello,
>>>
>>> I have been trying to fix the error that might be due to MPI_Send or
>>> MPI_Recv.
>>> I am trying to implement Red-Black SOR for solving a 3-D heat
>>> conduction
>>> problem which requires parallel solution of the system of linear
>>> equations At=F.
>>>
>>> A is 7 banded coefficient matrix stored in N x 7 2-D array, t
>>> represents
>>> temeprature field for 3-D domain hence stored in a 3-D array. F is a
>>> vector
>>> (single column matrix).
>>>
>>> I didved cuboidal piece along its thickness and assigned each slice
>>> to
>>> a
>>> processor. Within a slice red & black planes are defined one after
>>> another.
>>> Data needs to be exchanged between adjacent slices to achieve
>>> parallel
>>> solution
>>> of the abovesaid problem.
>>>
>>> below is the code & subroutine I am using:
>>>
>>> -----------------------------------------------------------
>>> for(n=1; n<=Nt; n++)
>>> {
>>>
>>> cout<<"This is rank number - "<<rank<<endl;
>>>
>>> if(rank != num_processes-1) local_Nz = rows_per_process*(1+rank);
>>> else local_Nz = Nz;
>>>
>>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
>>> cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
>>> = "<<local_Nz<<endl;
>>>
>>>
>>> calculate_F(rank, local_Nz);
>>>
>>> comm_tag = n;
>>> exchange_interface_data(rank, local_Nz, comm_tag);
>>>
>>> Red_SOR(A, F, T, rank, local_Nz);
>>>
>>> comm_tag = n+1;
>>> exchange_interface_data(rank, local_Nz, comm_tag);
>>>
>>> Black_SOR(A, F, T, rank, local_Nz);
>>>
>>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
>>> for(j=1; j<=Ny; j++)
>>> for(i=1; i<=Nx; i++)
>>> u[i][j][k] = (1 + 2*Tq/t) * T[i][j][k] + (1 - 2*Tq/t) *
>>> old_T[i][j][k]
>>> - u[i][j]
>>> [k];
>>>
>>>
>>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
>>> for(j=1; j<=Ny; j++)
>>> for(i=1; i<=Nx; i++)
>>> old_T[i][j][k] = T[i][j][k];
>>>
>>>
>>> cout<<rank<<" prints value of F[1] = "<<F[1]<<" m = "<<m<<endl;
>>>
>>> MPI_Barrier(MPI_COMM_WORLD);
>>>
>>>
>>> if(rank == 0)
>>> {
>>> outfile_SOR.precision(20);
>>> outfile_SOR<<setw(20)<<t*n<<" "<<setw(20)<<T[1][1][1]<<"
>>> "<<endl;
>>> cout<<"n = "<<n<<" Nt = "<<Nt<<endl;
>>> }
>>>
>>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
>>> cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
>>> = "<<local_Nz<<endl;
>>>
>>> MPI_Barrier(MPI_COMM_WORLD);
>>>
>>> }
>>>
>>> -------------------------------------------------------
>>>
>>> and subroutine for data excahnge is as follows:
>>>
>>> -------------------------------------------------------------
>>>
>>> void exchange_interface_data(int rank, int local_Nz, int comm_tag)
>>> {
>>>
>>> int err;
>>> MPI_Status status;
>>> MPI_Request request;
>>>
>>> cout<<rank<<" printing from exchange_interface_data
>>> subroutine"<<endl;
>>> if(rank%2==0)
>>> MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny,
>>> MPI_DOUBLE,
>>> rank+1,
>>> comm_tag+rank, MPI_COMM_WORLD);
>>>
>>> if(rank%2==1)
>>> MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny,
>>> MPI_DOUBLE,
>>> rank-1,
>>> comm_tag+rank-1, MPI_COMM_WORLD, &status);
>>>
>>>
>>> MPI_Wait(&request,&status);
>>>
>>> if(err==1)
>>> {
>>> cout<<"Error in MPI_Send/Recv"<<endl;
>>>
>>> }
>>>
>>> if(rank%2==1)
>>> MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
>>> comm_tag+rank+50, MPI_COMM_WORLD);
>>>
>>> if(rank%2==0)
>>> MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
>>> comm_tag+rank+51, MPI_COMM_WORLD, &status);
>>>
>>>
>>> cout<<"end of exchange_interface_data subroutine"<<endl;
>>> }
>>>
>>> ----------------------------------------------
>>>
>>> these are the outputs for two different cases:
>>>
>>> *******************************************************
>>> [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
>>> Mon Feb 14 23:37:07 2005
>>>
>>> enter the value of physical time (in pico-seconds)
>>> .01
>>> enter the value of space step size in X dierction (nano-meter)
>>> 20
>>> enter the value of space step size in Y dierction(nano-meter)
>>> 20
>>> enter the value of space step size in Z dierction(nano-meter)
>>> 5
>>> enter the number of rows/planes per processor
>>> 4
>>> Enter the value of time step t =
>>> .01
>>> Nx = 26 Ny = 26 Nz = 21 delta t = 0.01 Nt = 1
>>> a = 2171.39 b = -22.5012 c = -22.5012 d = -360.02
>>> This is rank number - 0
>>> rank = 0 Temp = 300 local Nz = 4
>>> rank = 0 Temp = 300 local Nz = 4
>>> rank = 0 Temp = 300 local Nz = 4
>>> rank = 0 Temp = 300 local Nz = 4
>>> Printing from F - rank = 0
>>> 0 printing from exchange_interface_data subroutine
>>> This is rank number - 2
>>> This is rank number - 1
>>> rank = 1 Temp = 300 local Nz = 8
>>> rank = 1 Temp = 300 local Nz = 8
>>> rank = 1 Temp = 300 local Nz = 8
>>> rank = 1 Temp = 300 local Nz = 8
>>> Printing from F - rank = 1
>>> 1 printing from exchange_interface_data subroutine
>>> end of exchange_interface_data subroutine
>>> Printing from Red SOR: rank = 0
>>> rank 0 max error norm is: 0.00610674
>>> rank 0 max error norm is: 0.00291226
>>> rank 0 max error norm is: 0.00138491
>>> rank 0 max error norm is: 0.000656796
>>> rank = 2 Temp = 300 local Nz = 12
>>> rank = 2 Temp = 300 local Nz = 12
>>> rank = 2 Temp = 300 local Nz = 12
>>> rank = 2 Temp = 300 local Nz = 12
>>> Printing from F - rank = 2
>>> rank 0 max error norm is: 0.000310676
>>> rank 0 max error norm is: 0.000146586
>>> rank 0 max error norm is: 6.89966e-05
>>> rank 0 max error norm is: 3.24e-05
>>> rank 0 max error norm is: 1.51802e-05
>>> rank 0 max error norm is: 7.0967e-06
>>> rank 0 max error norm is: 3.31059e-06
>>> rank 0 max error norm is: 1.54118e-06
>>> 2 printing from exchange_interface_data subroutine
>>> rank 0 max error norm is: 7.16008e-07
>>> RED: rank = 0 Total number of iterations performed: 13
>>> 0 printing from exchange_interface_data subroutine
>>> MPI process rank 0 (n0, p4386) caught a SIGSEGV in MPI_Wait.
>>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>> Rank (0, MPI_COMM_WORLD): - MPI_Wait()
>>> Rank (0, MPI_COMM_WORLD): - main()
>>> This is rank number - 3
>>> rank = 3 Temp = 300 local Nz = 16
>>> rank = 3 Temp = 300 local Nz = 16
>>> rank = 3 Temp = 300 local Nz = 16
>>> rank = 3 Temp = 300 local Nz = 16
>>> Printing from F - rank = 3
>>> end of exchange_interface_data subroutine
>>> Printing from Red SOR: rank = 1
>>> rank 1 max error norm is: 0.00165236
>>> rank 1 max error norm is: 0.000787998
>>> rank 1 max error norm is: 0.000374727
>>> rank 1 max error norm is: 0.000177715
>>> rank 1 max error norm is: 8.40624e-05
>>> rank 1 max error norm is: 3.96633e-05
>>> 3 printing from exchange_interface_data subroutine
>>> rank 1 max error norm is: 1.86691e-05
>>> end of exchange_interface_data subroutine
>>> end of exchange_interface_data subroutine
>>> Printing from Red SOR: rank = 2
>>> rank 1 max error norm is: 8.76678e-06
>>> rank 1 max error norm is: 4.10746e-06
>>> rank 2 max error norm is: 0.000447094
>>> rank 1 max error norm is: 1.92022e-06
>>> rank 1 max error norm is: 8.95779e-07
>>> RED: rank = 1 Total number of iterations performed: 11
>>> 1 printing from exchange_interface_data subroutine
>>> rank 2 max error norm is: 0.000213216
>>> MPI process rank 1 (n0, p4387) caught a SIGSEGV in MPI_Wait.
>>> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
>>> Rank (1, MPI_COMM_WORLD): - MPI_Wait()
>>> Rank (1, MPI_COMM_WORLD): - main()
>>> rank 2 max error norm is: 0.000101393
>>> rank 2 max error norm is: 4.80861e-05
>>> rank 2 max error norm is: 2.27456e-05
>>> ---------------------------------------------------------------------
>>> --
>>> ------
>>>
>>> One of the processes started by mpirun has exited with a nonzero exit
>>> code. This typically indicates that the process finished in error.
>>> If your process did not finish in error, be sure to include a "return
>>> 0" or "exit(0)" in your C code before exiting the application.
>>>
>>> PID 4386 failed on node n0 with exit status 1.
>>> ---------------------------------------------------------------------
>>> --
>>> ------
>>> rank 2 max error norm is: 1.07321e-05
>>> Printing from Red SOR: rank = 3
>>>
>>> *****************************************************************
>>>
>>> another output is:
>>>
>>> ******************************************
>>>
>>> [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
>>> Tue Feb 15 14:06:07 2005
>>>
>>> enter the value of physical time (in pico-seconds)
>>> .01
>>> enter the value of space step size in X dierction (nano-meter)
>>> 10
>>> enter the value of space step size in Y dierction(nano-meter)
>>> 10
>>> enter the value of space step size in Z dierction(nano-meter)
>>> 2
>>> enter the number of rows/planes per processor
>>> 10
>>> Enter the value of time step t =
>>> .01
>>> Nx = 51 Ny = 51 Nz = 51 delta t = 0.01 Nt = 1
>>> a = 6221.61 b = -90.005 c = -90.005 d = -2250.12
>>> This is rank number - 0
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> rank = 0 Temp = 300 local Nz = 10
>>> Printing from F - rank = 0
>>> This is rank number - 1
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> rank = 1 Temp = 300 local Nz = 20
>>> Printing from F - rank = 1
>>> This is rank number - 2
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> rank = 2 Temp = 300 local Nz = 30
>>> Printing from F - rank = 2
>>> This is rank number - 4
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> rank = 4 Temp = 300 local Nz = 51
>>> Printing from F - rank = 4
>>> 2 printing from exchange_interface_data subroutine
>>> This is rank number - 3
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> rank = 3 Temp = 300 local Nz = 40
>>> Printing from F - rank = 3
>>> 1 printing from exchange_interface_data subroutine
>>> 3 printing from exchange_interface_data subroutine
>>> end of exchange_interface_data subroutine
>>> Printing from Red SOR: rank = 3
>>> end of exchange_interface_data subroutine
>>> Printing from Red SOR: rank = 2
>>> rank 2 max error norm is: 0.000159936
>>> rank 2 max error norm is: 7.79543e-05
>>> 0 printing from exchange_interface_data subroutine
>>> end of exchange_interface_data subroutine
>>> Printing from Red SOR: rank = 1
>>> end of exchange_interface_data subroutine
>>> rank 2 max error norm is: 3.77884e-05
>>> Printing from Red SOR: rank = 0
>>> rank 2 max error norm is: 1.82237e-05
>>> rank 1 max error norm is: 0.000591088
>>> rank 2 max error norm is: 8.74567e-06
>>> rank 0 max error norm is: 0.00218453
>>> rank 2 max error norm is: 4.17761e-06
>>> rank 1 max error norm is: 0.000288101
>>> rank 2 max error norm is: 1.9867e-06
>>> rank 2 max error norm is: 9.40774e-07
>>> RED: rank = 2 Total number of iterations performed: 8
>>> 2 printing from exchange_interface_data subroutine
>>> MPI process rank 2 (n0, p5201) caught a SIGSEGV in MPI_Wait.
>>> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
>>> Rank (2, MPI_COMM_WORLD): - MPI_Wait()
>>> Rank (2, MPI_COMM_WORLD): - main()
>>> 4 printing from exchange_interface_data subroutine
>>> MPI_Send: invalid rank (rank 4, MPI_COMM_WORLD)
>>> Rank (4, MPI_COMM_WORLD): Call stack within LAM:
>>> Rank (4, MPI_COMM_WORLD): - MPI_Send()
>>> Rank (4, MPI_COMM_WORLD): - main()
>>> rank 0 max error norm is: 0.00106476
>>> rank 3 max error norm is: 4.32755e-05
>>> rank 3 max error norm is: 2.10928e-05
>>> rank 1 max error norm is: 0.000139657
>>> ---------------------------------------------------------------------
>>> --
>>> ------
>>>
>>> One of the processes started by mpirun has exited with a nonzero exit
>>> code. This typically indicates that the process finished in error.
>>> If your process did not finish in error, be sure to include a "return
>>> 0" or "exit(0)" in your C code before exiting the application.
>>>
>>> PID 5200 failed on node n0 with exit status 1.
>>> ---------------------------------------------------------------------
>>> --
>>> ------
>>>
>>> *********************************************
>>>
>>> I have been trying to fix this but could not. Please if anyone can
>>> shed some
>>> light on this, I will be oblidged. Please help me out.
>>>
>>> Thanks!
>>>
>>> Ravi R. Kumar
>>> Research Assitant
>>> 318 RGAN, RTL
>>> University of Kentucky
>>> 859 257-6336 x 80697
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

----
Josh Hursey
jjhursey_at_[hidden]
http://www.lam-mpi.org/