LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Kumar, Ravi Ranjan (rrkuma0_at_[hidden])
Date: 2005-02-17 12:56:41


Thanks a lot for pointing out my mistakes. Now, my code is running error free.
I am a begginer so I am likely to commit this type small mistakes. Thanks for
helping me out!

Ravi R. Kumar

Quoting Josh Hursey <jjhursey_at_[hidden]>:

> When you call:
> MPI_Wait(&request,&status);
>
> 'request' needs to be initialized from a non-blocking send (such as an
> Isend) or receive (such as MPI_Irecv). In the code below you have a
> basic MPI_Send which doesn't interact with MPI_Requests. So the SEGSEGV
> is coming from the uninitialized argument to MPI_Wait.
>
> Take a look at the man pages for MPI_Send and MPI_Isend for the
> differences, but the MPI_Wait can be taken out of the code below to get
> it working again.
>
> Josh
>
> On Feb 16, 2005, at 2:12 PM, Kumar, Ravi Ranjan wrote:
>
> > Thanks a lot Jeff!
> >
> > I could fix the first error (invalid rank) by putting appropriate
> > bounds.
> > However, the second error (SIGSEGV in MPI_Wait) is still appearing. Is
> > it due
> > to wrong arguments in MPI_Wait?
> >
> > This is the subroutine for data exchange:
> >
> > void exchange_interface_data(int rank, int local_Nz, int comm_tag)
> > {
> >
> > int err;
> > MPI_Status status;
> > MPI_Request request;
> >
> > if(rank%2==0 && rank != num_processes-1)
> > MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> > rank+1,
> > comm_tag+rank, MPI_COMM_WORLD);
> >
> > if(rank%2==1 && rank != 0)
> > MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE,
> > rank-1,
> > comm_tag+rank-1, MPI_COMM_WORLD, &status);
> >
> >
> > MPI_Wait(&request,&status);
> >
> > if(err==1)
> > cout<<"Error in MPI_Send/Recv"<<endl;
> >
> > if(rank%2==1 && rank != 0)
> > MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
> > comm_tag+rank+50, MPI_COMM_WORLD);
> >
> > if(rank%2==0 && rank != num_processes-1)
> > MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
> > comm_tag+rank+51, MPI_COMM_WORLD, &status);
> >
> >
> > }
> >
> > Am I using MPI_wait correctly or not. It would be really great if you
> > could
> > help me.
> >
> > Thanks a lot!
> >
> > Ravi R. Kumar
> > Research Assitant
> > 318 RGAN, RTL
> > University of Kentucky
> > (859) 257-6336 x 80697
> >
> >
> >
> >
> > Quoting Jeff Squyres <jsquyres_at_[hidden]>:
> >
> >> It's hard to debug someone else's application, particularly one as
> >> complex as this, especially without seeing the entire application.
> >> However, I have a few suggestions for you:
> >>
> >> 1. It seems that the outputs contain two errors: invalid rank in
> >> MPI_SEND and segv in MPI_WAIT. The invalid rank should be pretty easy
> >> to track down. I notice that one of your MPI_SEND's goes to rank+1,
> >> but you don't do any bounds checking to ensure that rank+1 is a valid
> >> rank in MPI_COMM_WORLD (e.g., if you had an odd number of processes).
> >>
> >> 2. A seg fault in MPI_WAIT is typically (but not always) a symptom of
> >> memory badness elsewhere in the application (e.g., a buffer overflow).
> >> I highly suggest that you run your application through a
> >> memory-checking debugger (such as Valgrind, if you're running on
> >> x86/Linux) to see what it can find for you. See the LAM FAQ for
> >> details on how to do this.
> >>
> >> As a final suggestion, unless you're simply trying to segregate your
> >> debugging output, I'd remove the calls to MPI_BARRIER -- they don't
> >> seem to serve any purpose.
> >>
> >>
> >>
> >> On Feb 15, 2005, at 2:07 PM, Kumar, Ravi Ranjan wrote:
> >>
> >>> Hello,
> >>>
> >>> I have been trying to fix the error that might be due to MPI_Send or
> >>> MPI_Recv.
> >>> I am trying to implement Red-Black SOR for solving a 3-D heat
> >>> conduction
> >>> problem which requires parallel solution of the system of linear
> >>> equations At=F.
> >>>
> >>> A is 7 banded coefficient matrix stored in N x 7 2-D array, t
> >>> represents
> >>> temeprature field for 3-D domain hence stored in a 3-D array. F is a
> >>> vector
> >>> (single column matrix).
> >>>
> >>> I didved cuboidal piece along its thickness and assigned each slice
> >>> to
> >>> a
> >>> processor. Within a slice red & black planes are defined one after
> >>> another.
> >>> Data needs to be exchanged between adjacent slices to achieve
> >>> parallel
> >>> solution
> >>> of the abovesaid problem.
> >>>
> >>> below is the code & subroutine I am using:
> >>>
> >>> -----------------------------------------------------------
> >>> for(n=1; n<=Nt; n++)
> >>> {
> >>>
> >>> cout<<"This is rank number - "<<rank<<endl;
> >>>
> >>> if(rank != num_processes-1) local_Nz = rows_per_process*(1+rank);
> >>> else local_Nz = Nz;
> >>>
> >>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> >>> cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
> >>> = "<<local_Nz<<endl;
> >>>
> >>>
> >>> calculate_F(rank, local_Nz);
> >>>
> >>> comm_tag = n;
> >>> exchange_interface_data(rank, local_Nz, comm_tag);
> >>>
> >>> Red_SOR(A, F, T, rank, local_Nz);
> >>>
> >>> comm_tag = n+1;
> >>> exchange_interface_data(rank, local_Nz, comm_tag);
> >>>
> >>> Black_SOR(A, F, T, rank, local_Nz);
> >>>
> >>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> >>> for(j=1; j<=Ny; j++)
> >>> for(i=1; i<=Nx; i++)
> >>> u[i][j][k] = (1 + 2*Tq/t) * T[i][j][k] + (1 - 2*Tq/t) *
> >>> old_T[i][j][k]
> >>> - u[i][j]
> >>> [k];
> >>>
> >>>
> >>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> >>> for(j=1; j<=Ny; j++)
> >>> for(i=1; i<=Nx; i++)
> >>> old_T[i][j][k] = T[i][j][k];
> >>>
> >>>
> >>> cout<<rank<<" prints value of F[1] = "<<F[1]<<" m = "<<m<<endl;
> >>>
> >>> MPI_Barrier(MPI_COMM_WORLD);
> >>>
> >>>
> >>> if(rank == 0)
> >>> {
> >>> outfile_SOR.precision(20);
> >>> outfile_SOR<<setw(20)<<t*n<<" "<<setw(20)<<T[1][1][1]<<"
> >>> "<<endl;
> >>> cout<<"n = "<<n<<" Nt = "<<Nt<<endl;
> >>> }
> >>>
> >>> for(k=1+rank*rows_per_process; k<=local_Nz; k++)
> >>> cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
> >>> = "<<local_Nz<<endl;
> >>>
> >>> MPI_Barrier(MPI_COMM_WORLD);
> >>>
> >>> }
> >>>
> >>> -------------------------------------------------------
> >>>
> >>> and subroutine for data excahnge is as follows:
> >>>
> >>> -------------------------------------------------------------
> >>>
> >>> void exchange_interface_data(int rank, int local_Nz, int comm_tag)
> >>> {
> >>>
> >>> int err;
> >>> MPI_Status status;
> >>> MPI_Request request;
> >>>
> >>> cout<<rank<<" printing from exchange_interface_data
> >>> subroutine"<<endl;
> >>> if(rank%2==0)
> >>> MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny,
> >>> MPI_DOUBLE,
> >>> rank+1,
> >>> comm_tag+rank, MPI_COMM_WORLD);
> >>>
> >>> if(rank%2==1)
> >>> MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny,
> >>> MPI_DOUBLE,
> >>> rank-1,
> >>> comm_tag+rank-1, MPI_COMM_WORLD, &status);
> >>>
> >>>
> >>> MPI_Wait(&request,&status);
> >>>
> >>> if(err==1)
> >>> {
> >>> cout<<"Error in MPI_Send/Recv"<<endl;
> >>>
> >>> }
> >>>
> >>> if(rank%2==1)
> >>> MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
> >>> comm_tag+rank+50, MPI_COMM_WORLD);
> >>>
> >>> if(rank%2==0)
> >>> MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
> >>> comm_tag+rank+51, MPI_COMM_WORLD, &status);
> >>>
> >>>
> >>> cout<<"end of exchange_interface_data subroutine"<<endl;
> >>> }
> >>>
> >>> ----------------------------------------------
> >>>
> >>> these are the outputs for two different cases:
> >>>
> >>> *******************************************************
> >>> [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
> >>> Mon Feb 14 23:37:07 2005
> >>>
> >>> enter the value of physical time (in pico-seconds)
> >>> .01
> >>> enter the value of space step size in X dierction (nano-meter)
> >>> 20
> >>> enter the value of space step size in Y dierction(nano-meter)
> >>> 20
> >>> enter the value of space step size in Z dierction(nano-meter)
> >>> 5
> >>> enter the number of rows/planes per processor
> >>> 4
> >>> Enter the value of time step t =
> >>> .01
> >>> Nx = 26 Ny = 26 Nz = 21 delta t = 0.01 Nt = 1
> >>> a = 2171.39 b = -22.5012 c = -22.5012 d = -360.02
> >>> This is rank number - 0
> >>> rank = 0 Temp = 300 local Nz = 4
> >>> rank = 0 Temp = 300 local Nz = 4
> >>> rank = 0 Temp = 300 local Nz = 4
> >>> rank = 0 Temp = 300 local Nz = 4
> >>> Printing from F - rank = 0
> >>> 0 printing from exchange_interface_data subroutine
> >>> This is rank number - 2
> >>> This is rank number - 1
> >>> rank = 1 Temp = 300 local Nz = 8
> >>> rank = 1 Temp = 300 local Nz = 8
> >>> rank = 1 Temp = 300 local Nz = 8
> >>> rank = 1 Temp = 300 local Nz = 8
> >>> Printing from F - rank = 1
> >>> 1 printing from exchange_interface_data subroutine
> >>> end of exchange_interface_data subroutine
> >>> Printing from Red SOR: rank = 0
> >>> rank 0 max error norm is: 0.00610674
> >>> rank 0 max error norm is: 0.00291226
> >>> rank 0 max error norm is: 0.00138491
> >>> rank 0 max error norm is: 0.000656796
> >>> rank = 2 Temp = 300 local Nz = 12
> >>> rank = 2 Temp = 300 local Nz = 12
> >>> rank = 2 Temp = 300 local Nz = 12
> >>> rank = 2 Temp = 300 local Nz = 12
> >>> Printing from F - rank = 2
> >>> rank 0 max error norm is: 0.000310676
> >>> rank 0 max error norm is: 0.000146586
> >>> rank 0 max error norm is: 6.89966e-05
> >>> rank 0 max error norm is: 3.24e-05
> >>> rank 0 max error norm is: 1.51802e-05
> >>> rank 0 max error norm is: 7.0967e-06
> >>> rank 0 max error norm is: 3.31059e-06
> >>> rank 0 max error norm is: 1.54118e-06
> >>> 2 printing from exchange_interface_data subroutine
> >>> rank 0 max error norm is: 7.16008e-07
> >>> RED: rank = 0 Total number of iterations performed: 13
> >>> 0 printing from exchange_interface_data subroutine
> >>> MPI process rank 0 (n0, p4386) caught a SIGSEGV in MPI_Wait.
> >>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> >>> Rank (0, MPI_COMM_WORLD): - MPI_Wait()
> >>> Rank (0, MPI_COMM_WORLD): - main()
> >>> This is rank number - 3
> >>> rank = 3 Temp = 300 local Nz = 16
> >>> rank = 3 Temp = 300 local Nz = 16
> >>> rank = 3 Temp = 300 local Nz = 16
> >>> rank = 3 Temp = 300 local Nz = 16
> >>> Printing from F - rank = 3
> >>> end of exchange_interface_data subroutine
> >>> Printing from Red SOR: rank = 1
> >>> rank 1 max error norm is: 0.00165236
> >>> rank 1 max error norm is: 0.000787998
> >>> rank 1 max error norm is: 0.000374727
> >>> rank 1 max error norm is: 0.000177715
> >>> rank 1 max error norm is: 8.40624e-05
> >>> rank 1 max error norm is: 3.96633e-05
> >>> 3 printing from exchange_interface_data subroutine
> >>> rank 1 max error norm is: 1.86691e-05
> >>> end of exchange_interface_data subroutine
> >>> end of exchange_interface_data subroutine
> >>> Printing from Red SOR: rank = 2
> >>> rank 1 max error norm is: 8.76678e-06
> >>> rank 1 max error norm is: 4.10746e-06
> >>> rank 2 max error norm is: 0.000447094
> >>> rank 1 max error norm is: 1.92022e-06
> >>> rank 1 max error norm is: 8.95779e-07
> >>> RED: rank = 1 Total number of iterations performed: 11
> >>> 1 printing from exchange_interface_data subroutine
> >>> rank 2 max error norm is: 0.000213216
> >>> MPI process rank 1 (n0, p4387) caught a SIGSEGV in MPI_Wait.
> >>> Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> >>> Rank (1, MPI_COMM_WORLD): - MPI_Wait()
> >>> Rank (1, MPI_COMM_WORLD): - main()
> >>> rank 2 max error norm is: 0.000101393
> >>> rank 2 max error norm is: 4.80861e-05
> >>> rank 2 max error norm is: 2.27456e-05
> >>> ---------------------------------------------------------------------
> >>> --
> >>> ------
> >>>
> >>> One of the processes started by mpirun has exited with a nonzero exit
> >>> code. This typically indicates that the process finished in error.
> >>> If your process did not finish in error, be sure to include a "return
> >>> 0" or "exit(0)" in your C code before exiting the application.
> >>>
> >>> PID 4386 failed on node n0 with exit status 1.
> >>> ---------------------------------------------------------------------
> >>> --
> >>> ------
> >>> rank 2 max error norm is: 1.07321e-05
> >>> Printing from Red SOR: rank = 3
> >>>
> >>> *****************************************************************
> >>>
> >>> another output is:
> >>>
> >>> ******************************************
> >>>
> >>> [rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
> >>> Tue Feb 15 14:06:07 2005
> >>>
> >>> enter the value of physical time (in pico-seconds)
> >>> .01
> >>> enter the value of space step size in X dierction (nano-meter)
> >>> 10
> >>> enter the value of space step size in Y dierction(nano-meter)
> >>> 10
> >>> enter the value of space step size in Z dierction(nano-meter)
> >>> 2
> >>> enter the number of rows/planes per processor
> >>> 10
> >>> Enter the value of time step t =
> >>> .01
> >>> Nx = 51 Ny = 51 Nz = 51 delta t = 0.01 Nt = 1
> >>> a = 6221.61 b = -90.005 c = -90.005 d = -2250.12
> >>> This is rank number - 0
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> rank = 0 Temp = 300 local Nz = 10
> >>> Printing from F - rank = 0
> >>> This is rank number - 1
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> rank = 1 Temp = 300 local Nz = 20
> >>> Printing from F - rank = 1
> >>> This is rank number - 2
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> rank = 2 Temp = 300 local Nz = 30
> >>> Printing from F - rank = 2
> >>> This is rank number - 4
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> rank = 4 Temp = 300 local Nz = 51
> >>> Printing from F - rank = 4
> >>> 2 printing from exchange_interface_data subroutine
> >>> This is rank number - 3
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> rank = 3 Temp = 300 local Nz = 40
> >>> Printing from F - rank = 3
> >>> 1 printing from exchange_interface_data subroutine
> >>> 3 printing from exchange_interface_data subroutine
> >>> end of exchange_interface_data subroutine
> >>> Printing from Red SOR: rank = 3
> >>> end of exchange_interface_data subroutine
> >>> Printing from Red SOR: rank = 2
> >>> rank 2 max error norm is: 0.000159936
> >>> rank 2 max error norm is: 7.79543e-05
> >>> 0 printing from exchange_interface_data subroutine
> >>> end of exchange_interface_data subroutine
> >>> Printing from Red SOR: rank = 1
> >>> end of exchange_interface_data subroutine
> >>> rank 2 max error norm is: 3.77884e-05
> >>> Printing from Red SOR: rank = 0
> >>> rank 2 max error norm is: 1.82237e-05
> >>> rank 1 max error norm is: 0.000591088
> >>> rank 2 max error norm is: 8.74567e-06
> >>> rank 0 max error norm is: 0.00218453
> >>> rank 2 max error norm is: 4.17761e-06
> >>> rank 1 max error norm is: 0.000288101
> >>> rank 2 max error norm is: 1.9867e-06
> >>> rank 2 max error norm is: 9.40774e-07
> >>> RED: rank = 2 Total number of iterations performed: 8
> >>> 2 printing from exchange_interface_data subroutine
> >>> MPI process rank 2 (n0, p5201) caught a SIGSEGV in MPI_Wait.
> >>> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> >>> Rank (2, MPI_COMM_WORLD): - MPI_Wait()
> >>> Rank (2, MPI_COMM_WORLD): - main()
> >>> 4 printing from exchange_interface_data subroutine
> >>> MPI_Send: invalid rank (rank 4, MPI_COMM_WORLD)
> >>> Rank (4, MPI_COMM_WORLD): Call stack within LAM:
> >>> Rank (4, MPI_COMM_WORLD): - MPI_Send()
> >>> Rank (4, MPI_COMM_WORLD): - main()
> >>> rank 0 max error norm is: 0.00106476
> >>> rank 3 max error norm is: 4.32755e-05
> >>> rank 3 max error norm is: 2.10928e-05
> >>> rank 1 max error norm is: 0.000139657
> >>> ---------------------------------------------------------------------
> >>> --
> >>> ------
> >>>
> >>> One of the processes started by mpirun has exited with a nonzero exit
> >>> code. This typically indicates that the process finished in error.
> >>> If your process did not finish in error, be sure to include a "return
> >>> 0" or "exit(0)" in your C code before exiting the application.
> >>>
> >>> PID 5200 failed on node n0 with exit status 1.
> >>> ---------------------------------------------------------------------
> >>> --
> >>> ------
> >>>
> >>> *********************************************
> >>>
> >>> I have been trying to fix this but could not. Please if anyone can
> >>> shed some
> >>> light on this, I will be oblidged. Please help me out.
> >>>
> >>> Thanks!
> >>>
> >>> Ravi R. Kumar
> >>> Research Assitant
> >>> 318 RGAN, RTL
> >>> University of Kentucky
> >>> 859 257-6336 x 80697
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>>
> >>
> >> --
> >> {+} Jeff Squyres
> >> {+} jsquyres_at_[hidden]
> >> {+} http://www.lam-mpi.org/
> >>
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>