It's hard to debug someone else's application, particularly one as
complex as this, especially without seeing the entire application.
However, I have a few suggestions for you:
1. It seems that the outputs contain two errors: invalid rank in
MPI_SEND and segv in MPI_WAIT. The invalid rank should be pretty easy
to track down. I notice that one of your MPI_SEND's goes to rank+1,
but you don't do any bounds checking to ensure that rank+1 is a valid
rank in MPI_COMM_WORLD (e.g., if you had an odd number of processes).
2. A seg fault in MPI_WAIT is typically (but not always) a symptom of
memory badness elsewhere in the application (e.g., a buffer overflow).
I highly suggest that you run your application through a
memory-checking debugger (such as Valgrind, if you're running on
x86/Linux) to see what it can find for you. See the LAM FAQ for
details on how to do this.
As a final suggestion, unless you're simply trying to segregate your
debugging output, I'd remove the calls to MPI_BARRIER -- they don't
seem to serve any purpose.
On Feb 15, 2005, at 2:07 PM, Kumar, Ravi Ranjan wrote:
> Hello,
>
> I have been trying to fix the error that might be due to MPI_Send or
> MPI_Recv.
> I am trying to implement Red-Black SOR for solving a 3-D heat
> conduction
> problem which requires parallel solution of the system of linear
> equations At=F.
>
> A is 7 banded coefficient matrix stored in N x 7 2-D array, t
> represents
> temeprature field for 3-D domain hence stored in a 3-D array. F is a
> vector
> (single column matrix).
>
> I didved cuboidal piece along its thickness and assigned each slice to
> a
> processor. Within a slice red & black planes are defined one after
> another.
> Data needs to be exchanged between adjacent slices to achieve parallel
> solution
> of the abovesaid problem.
>
> below is the code & subroutine I am using:
>
> -----------------------------------------------------------
> for(n=1; n<=Nt; n++)
> {
>
> cout<<"This is rank number - "<<rank<<endl;
>
> if(rank != num_processes-1) local_Nz = rows_per_process*(1+rank);
> else local_Nz = Nz;
>
> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
> = "<<local_Nz<<endl;
>
>
> calculate_F(rank, local_Nz);
>
> comm_tag = n;
> exchange_interface_data(rank, local_Nz, comm_tag);
>
> Red_SOR(A, F, T, rank, local_Nz);
>
> comm_tag = n+1;
> exchange_interface_data(rank, local_Nz, comm_tag);
>
> Black_SOR(A, F, T, rank, local_Nz);
>
> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> for(j=1; j<=Ny; j++)
> for(i=1; i<=Nx; i++)
> u[i][j][k] = (1 + 2*Tq/t) * T[i][j][k] + (1 - 2*Tq/t) * old_T[i][j][k]
> - u[i][j]
> [k];
>
>
> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> for(j=1; j<=Ny; j++)
> for(i=1; i<=Nx; i++)
> old_T[i][j][k] = T[i][j][k];
>
>
> cout<<rank<<" prints value of F[1] = "<<F[1]<<" m = "<<m<<endl;
>
> MPI_Barrier(MPI_COMM_WORLD);
>
>
> if(rank == 0)
> {
> outfile_SOR.precision(20);
> outfile_SOR<<setw(20)<<t*n<<" "<<setw(20)<<T[1][1][1]<<" "<<endl;
> cout<<"n = "<<n<<" Nt = "<<Nt<<endl;
> }
>
> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
> = "<<local_Nz<<endl;
>
> MPI_Barrier(MPI_COMM_WORLD);
>
> }
>
> -------------------------------------------------------
>
> and subroutine for data excahnge is as follows:
>
> -------------------------------------------------------------
>
> void exchange_interface_data(int rank, int local_Nz, int comm_tag)
> {
>
> int err;
> MPI_Status status;
> MPI_Request request;
>
> cout<<rank<<" printing from exchange_interface_data
> subroutine"<<endl;
> if(rank%2==0)
> MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> rank+1,
> comm_tag+rank, MPI_COMM_WORLD);
>
> if(rank%2==1)
> MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> rank-1,
> comm_tag+rank-1, MPI_COMM_WORLD, &status);
>
>
> MPI_Wait(&request,&status);
>
> if(err==1)
> {
> cout<<"Error in MPI_Send/Recv"<<endl;
>
> }
>
> if(rank%2==1)
> MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
> comm_tag+rank+50, MPI_COMM_WORLD);
>
> if(rank%2==0)
> MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
> comm_tag+rank+51, MPI_COMM_WORLD, &status);
>
>
> cout<<"end of exchange_interface_data subroutine"<<endl;
> }
>
> ----------------------------------------------
>
> these are the outputs for two different cases:
>
> *******************************************************
> [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
> Mon Feb 14 23:37:07 2005
>
> enter the value of physical time (in pico-seconds)
> .01
> enter the value of space step size in X dierction (nano-meter)
> 20
> enter the value of space step size in Y dierction(nano-meter)
> 20
> enter the value of space step size in Z dierction(nano-meter)
> 5
> enter the number of rows/planes per processor
> 4
> Enter the value of time step t =
> .01
> Nx = 26 Ny = 26 Nz = 21 delta t = 0.01 Nt = 1
> a = 2171.39 b = -22.5012 c = -22.5012 d = -360.02
> This is rank number - 0
> rank = 0 Temp = 300 local Nz = 4
> rank = 0 Temp = 300 local Nz = 4
> rank = 0 Temp = 300 local Nz = 4
> rank = 0 Temp = 300 local Nz = 4
> Printing from F - rank = 0
> 0 printing from exchange_interface_data subroutine
> This is rank number - 2
> This is rank number - 1
> rank = 1 Temp = 300 local Nz = 8
> rank = 1 Temp = 300 local Nz = 8
> rank = 1 Temp = 300 local Nz = 8
> rank = 1 Temp = 300 local Nz = 8
> Printing from F - rank = 1
> 1 printing from exchange_interface_data subroutine
> end of exchange_interface_data subroutine
> Printing from Red SOR: rank = 0
> rank 0 max error norm is: 0.00610674
> rank 0 max error norm is: 0.00291226
> rank 0 max error norm is: 0.00138491
> rank 0 max error norm is: 0.000656796
> rank = 2 Temp = 300 local Nz = 12
> rank = 2 Temp = 300 local Nz = 12
> rank = 2 Temp = 300 local Nz = 12
> rank = 2 Temp = 300 local Nz = 12
> Printing from F - rank = 2
> rank 0 max error norm is: 0.000310676
> rank 0 max error norm is: 0.000146586
> rank 0 max error norm is: 6.89966e-05
> rank 0 max error norm is: 3.24e-05
> rank 0 max error norm is: 1.51802e-05
> rank 0 max error norm is: 7.0967e-06
> rank 0 max error norm is: 3.31059e-06
> rank 0 max error norm is: 1.54118e-06
> 2 printing from exchange_interface_data subroutine
> rank 0 max error norm is: 7.16008e-07
> RED: rank = 0 Total number of iterations performed: 13
> 0 printing from exchange_interface_data subroutine
> MPI process rank 0 (n0, p4386) caught a SIGSEGV in MPI_Wait.
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Wait()
> Rank (0, MPI_COMM_WORLD): - main()
> This is rank number - 3
> rank = 3 Temp = 300 local Nz = 16
> rank = 3 Temp = 300 local Nz = 16
> rank = 3 Temp = 300 local Nz = 16
> rank = 3 Temp = 300 local Nz = 16
> Printing from F - rank = 3
> end of exchange_interface_data subroutine
> Printing from Red SOR: rank = 1
> rank 1 max error norm is: 0.00165236
> rank 1 max error norm is: 0.000787998
> rank 1 max error norm is: 0.000374727
> rank 1 max error norm is: 0.000177715
> rank 1 max error norm is: 8.40624e-05
> rank 1 max error norm is: 3.96633e-05
> 3 printing from exchange_interface_data subroutine
> rank 1 max error norm is: 1.86691e-05
> end of exchange_interface_data subroutine
> end of exchange_interface_data subroutine
> Printing from Red SOR: rank = 2
> rank 1 max error norm is: 8.76678e-06
> rank 1 max error norm is: 4.10746e-06
> rank 2 max error norm is: 0.000447094
> rank 1 max error norm is: 1.92022e-06
> rank 1 max error norm is: 8.95779e-07
> RED: rank = 1 Total number of iterations performed: 11
> 1 printing from exchange_interface_data subroutine
> rank 2 max error norm is: 0.000213216
> MPI process rank 1 (n0, p4387) caught a SIGSEGV in MPI_Wait.
> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> Rank (1, MPI_COMM_WORLD): - MPI_Wait()
> Rank (1, MPI_COMM_WORLD): - main()
> rank 2 max error norm is: 0.000101393
> rank 2 max error norm is: 4.80861e-05
> rank 2 max error norm is: 2.27456e-05
> -----------------------------------------------------------------------
> ------
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 4386 failed on node n0 with exit status 1.
> -----------------------------------------------------------------------
> ------
> rank 2 max error norm is: 1.07321e-05
> Printing from Red SOR: rank = 3
>
> *****************************************************************
>
> another output is:
>
> ******************************************
>
> [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
> Tue Feb 15 14:06:07 2005
>
> enter the value of physical time (in pico-seconds)
> .01
> enter the value of space step size in X dierction (nano-meter)
> 10
> enter the value of space step size in Y dierction(nano-meter)
> 10
> enter the value of space step size in Z dierction(nano-meter)
> 2
> enter the number of rows/planes per processor
> 10
> Enter the value of time step t =
> .01
> Nx = 51 Ny = 51 Nz = 51 delta t = 0.01 Nt = 1
> a = 6221.61 b = -90.005 c = -90.005 d = -2250.12
> This is rank number - 0
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> rank = 0 Temp = 300 local Nz = 10
> Printing from F - rank = 0
> This is rank number - 1
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> rank = 1 Temp = 300 local Nz = 20
> Printing from F - rank = 1
> This is rank number - 2
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> rank = 2 Temp = 300 local Nz = 30
> Printing from F - rank = 2
> This is rank number - 4
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> rank = 4 Temp = 300 local Nz = 51
> Printing from F - rank = 4
> 2 printing from exchange_interface_data subroutine
> This is rank number - 3
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> rank = 3 Temp = 300 local Nz = 40
> Printing from F - rank = 3
> 1 printing from exchange_interface_data subroutine
> 3 printing from exchange_interface_data subroutine
> end of exchange_interface_data subroutine
> Printing from Red SOR: rank = 3
> end of exchange_interface_data subroutine
> Printing from Red SOR: rank = 2
> rank 2 max error norm is: 0.000159936
> rank 2 max error norm is: 7.79543e-05
> 0 printing from exchange_interface_data subroutine
> end of exchange_interface_data subroutine
> Printing from Red SOR: rank = 1
> end of exchange_interface_data subroutine
> rank 2 max error norm is: 3.77884e-05
> Printing from Red SOR: rank = 0
> rank 2 max error norm is: 1.82237e-05
> rank 1 max error norm is: 0.000591088
> rank 2 max error norm is: 8.74567e-06
> rank 0 max error norm is: 0.00218453
> rank 2 max error norm is: 4.17761e-06
> rank 1 max error norm is: 0.000288101
> rank 2 max error norm is: 1.9867e-06
> rank 2 max error norm is: 9.40774e-07
> RED: rank = 2 Total number of iterations performed: 8
> 2 printing from exchange_interface_data subroutine
> MPI process rank 2 (n0, p5201) caught a SIGSEGV in MPI_Wait.
> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> Rank (2, MPI_COMM_WORLD): - MPI_Wait()
> Rank (2, MPI_COMM_WORLD): - main()
> 4 printing from exchange_interface_data subroutine
> MPI_Send: invalid rank (rank 4, MPI_COMM_WORLD)
> Rank (4, MPI_COMM_WORLD): Call stack within LAM:
> Rank (4, MPI_COMM_WORLD): - MPI_Send()
> Rank (4, MPI_COMM_WORLD): - main()
> rank 0 max error norm is: 0.00106476
> rank 3 max error norm is: 4.32755e-05
> rank 3 max error norm is: 2.10928e-05
> rank 1 max error norm is: 0.000139657
> -----------------------------------------------------------------------
> ------
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 5200 failed on node n0 with exit status 1.
> -----------------------------------------------------------------------
> ------
>
> *********************************************
>
> I have been trying to fix this but could not. Please if anyone can
> shed some
> light on this, I will be oblidged. Please help me out.
>
> Thanks!
>
> Ravi R. Kumar
> Research Assitant
> 318 RGAN, RTL
> University of Kentucky
> 859 257-6336 x 80697
>
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|