Thanks a lot Jeff!
I could fix the first error (invalid rank) by putting appropriate bounds.
However, the second error (SIGSEGV in MPI_Wait) is still appearing. Is it due
to wrong arguments in MPI_Wait?
This is the subroutine for data exchange:
void exchange_interface_data(int rank, int local_Nz, int comm_tag)
{
int err;
MPI_Status status;
MPI_Request request;
if(rank%2==0 && rank != num_processes-1)
MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE, rank+1,
comm_tag+rank, MPI_COMM_WORLD);
if(rank%2==1 && rank != 0)
MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE, rank-1,
comm_tag+rank-1, MPI_COMM_WORLD, &status);
MPI_Wait(&request,&status);
if(err==1)
cout<<"Error in MPI_Send/Recv"<<endl;
if(rank%2==1 && rank != 0)
MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
comm_tag+rank+50, MPI_COMM_WORLD);
if(rank%2==0 && rank != num_processes-1)
MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
comm_tag+rank+51, MPI_COMM_WORLD, &status);
}
Am I using MPI_wait correctly or not. It would be really great if you could
help me.
Thanks a lot!
Ravi R. Kumar
Research Assitant
318 RGAN, RTL
University of Kentucky
(859) 257-6336 x 80697
Quoting Jeff Squyres <jsquyres_at_[hidden]>:
> It's hard to debug someone else's application, particularly one as
> complex as this, especially without seeing the entire application.
> However, I have a few suggestions for you:
>
> 1. It seems that the outputs contain two errors: invalid rank in
> MPI_SEND and segv in MPI_WAIT. The invalid rank should be pretty easy
> to track down. I notice that one of your MPI_SEND's goes to rank+1,
> but you don't do any bounds checking to ensure that rank+1 is a valid
> rank in MPI_COMM_WORLD (e.g., if you had an odd number of processes).
>
> 2. A seg fault in MPI_WAIT is typically (but not always) a symptom of
> memory badness elsewhere in the application (e.g., a buffer overflow).
> I highly suggest that you run your application through a
> memory-checking debugger (such as Valgrind, if you're running on
> x86/Linux) to see what it can find for you. See the LAM FAQ for
> details on how to do this.
>
> As a final suggestion, unless you're simply trying to segregate your
> debugging output, I'd remove the calls to MPI_BARRIER -- they don't
> seem to serve any purpose.
>
>
>
> On Feb 15, 2005, at 2:07 PM, Kumar, Ravi Ranjan wrote:
>
> > Hello,
> >
> > I have been trying to fix the error that might be due to MPI_Send or
> > MPI_Recv.
> > I am trying to implement Red-Black SOR for solving a 3-D heat
> > conduction
> > problem which requires parallel solution of the system of linear
> > equations At=F.
> >
> > A is 7 banded coefficient matrix stored in N x 7 2-D array, t
> > represents
> > temeprature field for 3-D domain hence stored in a 3-D array. F is a
> > vector
> > (single column matrix).
> >
> > I didved cuboidal piece along its thickness and assigned each slice to
> > a
> > processor. Within a slice red & black planes are defined one after
> > another.
> > Data needs to be exchanged between adjacent slices to achieve parallel
> > solution
> > of the abovesaid problem.
> >
> > below is the code & subroutine I am using:
> >
> > -----------------------------------------------------------
> > for(n=1; n<=Nt; n++)
> > {
> >
> > cout<<"This is rank number - "<<rank<<endl;
> >
> > if(rank != num_processes-1) local_Nz = rows_per_process*(1+rank);
> > else local_Nz = Nz;
> >
> > for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> > cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
> > = "<<local_Nz<<endl;
> >
> >
> > calculate_F(rank, local_Nz);
> >
> > comm_tag = n;
> > exchange_interface_data(rank, local_Nz, comm_tag);
> >
> > Red_SOR(A, F, T, rank, local_Nz);
> >
> > comm_tag = n+1;
> > exchange_interface_data(rank, local_Nz, comm_tag);
> >
> > Black_SOR(A, F, T, rank, local_Nz);
> >
> > for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> > for(j=1; j<=Ny; j++)
> > for(i=1; i<=Nx; i++)
> > u[i][j][k] = (1 + 2*Tq/t) * T[i][j][k] + (1 - 2*Tq/t) * old_T[i][j][k]
> > - u[i][j]
> > [k];
> >
> >
> > for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> > for(j=1; j<=Ny; j++)
> > for(i=1; i<=Nx; i++)
> > old_T[i][j][k] = T[i][j][k];
> >
> >
> > cout<<rank<<" prints value of F[1] = "<<F[1]<<" m = "<<m<<endl;
> >
> > MPI_Barrier(MPI_COMM_WORLD);
> >
> >
> > if(rank == 0)
> > {
> > outfile_SOR.precision(20);
> > outfile_SOR<<setw(20)<<t*n<<" "<<setw(20)<<T[1][1][1]<<" "<<endl;
> > cout<<"n = "<<n<<" Nt = "<<Nt<<endl;
> > }
> >
> > for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> > cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
> > = "<<local_Nz<<endl;
> >
> > MPI_Barrier(MPI_COMM_WORLD);
> >
> > }
> >
> > -------------------------------------------------------
> >
> > and subroutine for data excahnge is as follows:
> >
> > -------------------------------------------------------------
> >
> > void exchange_interface_data(int rank, int local_Nz, int comm_tag)
> > {
> >
> > int err;
> > MPI_Status status;
> > MPI_Request request;
> >
> > cout<<rank<<" printing from exchange_interface_data
> > subroutine"<<endl;
> > if(rank%2==0)
> > MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> > rank+1,
> > comm_tag+rank, MPI_COMM_WORLD);
> >
> > if(rank%2==1)
> > MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> > rank-1,
> > comm_tag+rank-1, MPI_COMM_WORLD, &status);
> >
> >
> > MPI_Wait(&request,&status);
> >
> > if(err==1)
> > {
> > cout<<"Error in MPI_Send/Recv"<<endl;
> >
> > }
> >
> > if(rank%2==1)
> > MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
> > comm_tag+rank+50, MPI_COMM_WORLD);
> >
> > if(rank%2==0)
> > MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
> > comm_tag+rank+51, MPI_COMM_WORLD, &status);
> >
> >
> > cout<<"end of exchange_interface_data subroutine"<<endl;
> > }
> >
> > ----------------------------------------------
> >
> > these are the outputs for two different cases:
> >
> > *******************************************************
> > [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
> > Mon Feb 14 23:37:07 2005
> >
> > enter the value of physical time (in pico-seconds)
> > .01
> > enter the value of space step size in X dierction (nano-meter)
> > 20
> > enter the value of space step size in Y dierction(nano-meter)
> > 20
> > enter the value of space step size in Z dierction(nano-meter)
> > 5
> > enter the number of rows/planes per processor
> > 4
> > Enter the value of time step t =
> > .01
> > Nx = 26 Ny = 26 Nz = 21 delta t = 0.01 Nt = 1
> > a = 2171.39 b = -22.5012 c = -22.5012 d = -360.02
> > This is rank number - 0
> > rank = 0 Temp = 300 local Nz = 4
> > rank = 0 Temp = 300 local Nz = 4
> > rank = 0 Temp = 300 local Nz = 4
> > rank = 0 Temp = 300 local Nz = 4
> > Printing from F - rank = 0
> > 0 printing from exchange_interface_data subroutine
> > This is rank number - 2
> > This is rank number - 1
> > rank = 1 Temp = 300 local Nz = 8
> > rank = 1 Temp = 300 local Nz = 8
> > rank = 1 Temp = 300 local Nz = 8
> > rank = 1 Temp = 300 local Nz = 8
> > Printing from F - rank = 1
> > 1 printing from exchange_interface_data subroutine
> > end of exchange_interface_data subroutine
> > Printing from Red SOR: rank = 0
> > rank 0 max error norm is: 0.00610674
> > rank 0 max error norm is: 0.00291226
> > rank 0 max error norm is: 0.00138491
> > rank 0 max error norm is: 0.000656796
> > rank = 2 Temp = 300 local Nz = 12
> > rank = 2 Temp = 300 local Nz = 12
> > rank = 2 Temp = 300 local Nz = 12
> > rank = 2 Temp = 300 local Nz = 12
> > Printing from F - rank = 2
> > rank 0 max error norm is: 0.000310676
> > rank 0 max error norm is: 0.000146586
> > rank 0 max error norm is: 6.89966e-05
> > rank 0 max error norm is: 3.24e-05
> > rank 0 max error norm is: 1.51802e-05
> > rank 0 max error norm is: 7.0967e-06
> > rank 0 max error norm is: 3.31059e-06
> > rank 0 max error norm is: 1.54118e-06
> > 2 printing from exchange_interface_data subroutine
> > rank 0 max error norm is: 7.16008e-07
> > RED: rank = 0 Total number of iterations performed: 13
> > 0 printing from exchange_interface_data subroutine
> > MPI process rank 0 (n0, p4386) caught a SIGSEGV in MPI_Wait.
> > Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (0, MPI_COMM_WORLD): - MPI_Wait()
> > Rank (0, MPI_COMM_WORLD): - main()
> > This is rank number - 3
> > rank = 3 Temp = 300 local Nz = 16
> > rank = 3 Temp = 300 local Nz = 16
> > rank = 3 Temp = 300 local Nz = 16
> > rank = 3 Temp = 300 local Nz = 16
> > Printing from F - rank = 3
> > end of exchange_interface_data subroutine
> > Printing from Red SOR: rank = 1
> > rank 1 max error norm is: 0.00165236
> > rank 1 max error norm is: 0.000787998
> > rank 1 max error norm is: 0.000374727
> > rank 1 max error norm is: 0.000177715
> > rank 1 max error norm is: 8.40624e-05
> > rank 1 max error norm is: 3.96633e-05
> > 3 printing from exchange_interface_data subroutine
> > rank 1 max error norm is: 1.86691e-05
> > end of exchange_interface_data subroutine
> > end of exchange_interface_data subroutine
> > Printing from Red SOR: rank = 2
> > rank 1 max error norm is: 8.76678e-06
> > rank 1 max error norm is: 4.10746e-06
> > rank 2 max error norm is: 0.000447094
> > rank 1 max error norm is: 1.92022e-06
> > rank 1 max error norm is: 8.95779e-07
> > RED: rank = 1 Total number of iterations performed: 11
> > 1 printing from exchange_interface_data subroutine
> > rank 2 max error norm is: 0.000213216
> > MPI process rank 1 (n0, p4387) caught a SIGSEGV in MPI_Wait.
> > Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (1, MPI_COMM_WORLD): - MPI_Wait()
> > Rank (1, MPI_COMM_WORLD): - main()
> > rank 2 max error norm is: 0.000101393
> > rank 2 max error norm is: 4.80861e-05
> > rank 2 max error norm is: 2.27456e-05
> > -----------------------------------------------------------------------
> > ------
> >
> > One of the processes started by mpirun has exited with a nonzero exit
> > code. This typically indicates that the process finished in error.
> > If your process did not finish in error, be sure to include a "return
> > 0" or "exit(0)" in your C code before exiting the application.
> >
> > PID 4386 failed on node n0 with exit status 1.
> > -----------------------------------------------------------------------
> > ------
> > rank 2 max error norm is: 1.07321e-05
> > Printing from Red SOR: rank = 3
> >
> > *****************************************************************
> >
> > another output is:
> >
> > ******************************************
> >
> > [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
> > Tue Feb 15 14:06:07 2005
> >
> > enter the value of physical time (in pico-seconds)
> > .01
> > enter the value of space step size in X dierction (nano-meter)
> > 10
> > enter the value of space step size in Y dierction(nano-meter)
> > 10
> > enter the value of space step size in Z dierction(nano-meter)
> > 2
> > enter the number of rows/planes per processor
> > 10
> > Enter the value of time step t =
> > .01
> > Nx = 51 Ny = 51 Nz = 51 delta t = 0.01 Nt = 1
> > a = 6221.61 b = -90.005 c = -90.005 d = -2250.12
> > This is rank number - 0
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > rank = 0 Temp = 300 local Nz = 10
> > Printing from F - rank = 0
> > This is rank number - 1
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > rank = 1 Temp = 300 local Nz = 20
> > Printing from F - rank = 1
> > This is rank number - 2
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > rank = 2 Temp = 300 local Nz = 30
> > Printing from F - rank = 2
> > This is rank number - 4
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > rank = 4 Temp = 300 local Nz = 51
> > Printing from F - rank = 4
> > 2 printing from exchange_interface_data subroutine
> > This is rank number - 3
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > rank = 3 Temp = 300 local Nz = 40
> > Printing from F - rank = 3
> > 1 printing from exchange_interface_data subroutine
> > 3 printing from exchange_interface_data subroutine
> > end of exchange_interface_data subroutine
> > Printing from Red SOR: rank = 3
> > end of exchange_interface_data subroutine
> > Printing from Red SOR: rank = 2
> > rank 2 max error norm is: 0.000159936
> > rank 2 max error norm is: 7.79543e-05
> > 0 printing from exchange_interface_data subroutine
> > end of exchange_interface_data subroutine
> > Printing from Red SOR: rank = 1
> > end of exchange_interface_data subroutine
> > rank 2 max error norm is: 3.77884e-05
> > Printing from Red SOR: rank = 0
> > rank 2 max error norm is: 1.82237e-05
> > rank 1 max error norm is: 0.000591088
> > rank 2 max error norm is: 8.74567e-06
> > rank 0 max error norm is: 0.00218453
> > rank 2 max error norm is: 4.17761e-06
> > rank 1 max error norm is: 0.000288101
> > rank 2 max error norm is: 1.9867e-06
> > rank 2 max error norm is: 9.40774e-07
> > RED: rank = 2 Total number of iterations performed: 8
> > 2 printing from exchange_interface_data subroutine
> > MPI process rank 2 (n0, p5201) caught a SIGSEGV in MPI_Wait.
> > Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (2, MPI_COMM_WORLD): - MPI_Wait()
> > Rank (2, MPI_COMM_WORLD): - main()
> > 4 printing from exchange_interface_data subroutine
> > MPI_Send: invalid rank (rank 4, MPI_COMM_WORLD)
> > Rank (4, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (4, MPI_COMM_WORLD): - MPI_Send()
> > Rank (4, MPI_COMM_WORLD): - main()
> > rank 0 max error norm is: 0.00106476
> > rank 3 max error norm is: 4.32755e-05
> > rank 3 max error norm is: 2.10928e-05
> > rank 1 max error norm is: 0.000139657
> > -----------------------------------------------------------------------
> > ------
> >
> > One of the processes started by mpirun has exited with a nonzero exit
> > code. This typically indicates that the process finished in error.
> > If your process did not finish in error, be sure to include a "return
> > 0" or "exit(0)" in your C code before exiting the application.
> >
> > PID 5200 failed on node n0 with exit status 1.
> > -----------------------------------------------------------------------
> > ------
> >
> > *********************************************
> >
> > I have been trying to fix this but could not. Please if anyone can
> > shed some
> > light on this, I will be oblidged. Please help me out.
> >
> > Thanks!
> >
> > Ravi R. Kumar
> > Research Assitant
> > 318 RGAN, RTL
> > University of Kentucky
> > 859 257-6336 x 80697
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|