Hello,
I have been trying to fix the error that might be due to MPI_Send or MPI_Recv.
I am trying to implement Red-Black SOR for solving a 3-D heat conduction
problem which requires parallel solution of the system of linear equations At=F.
A is 7 banded coefficient matrix stored in N x 7 2-D array, t represents
temeprature field for 3-D domain hence stored in a 3-D array. F is a vector
(single column matrix).
I didved cuboidal piece along its thickness and assigned each slice to a
processor. Within a slice red & black planes are defined one after another.
Data needs to be exchanged between adjacent slices to achieve parallel solution
of the abovesaid problem.
below is the code & subroutine I am using:
-----------------------------------------------------------
for(n=1; n<=Nt; n++)
{
cout<<"This is rank number - "<<rank<<endl;
if(rank != num_processes-1) local_Nz = rows_per_process*(1+rank);
else local_Nz = Nz;
for(k=1+rank*rows_per_process; k<=local_Nz; k++)
cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
= "<<local_Nz<<endl;
calculate_F(rank, local_Nz);
comm_tag = n;
exchange_interface_data(rank, local_Nz, comm_tag);
Red_SOR(A, F, T, rank, local_Nz);
comm_tag = n+1;
exchange_interface_data(rank, local_Nz, comm_tag);
Black_SOR(A, F, T, rank, local_Nz);
for(k=1+rank*rows_per_process; k<=local_Nz; k++)
for(j=1; j<=Ny; j++)
for(i=1; i<=Nx; i++)
u[i][j][k] = (1 + 2*Tq/t) * T[i][j][k] + (1 - 2*Tq/t) * old_T[i][j][k] - u[i][j]
[k];
for(k=1+rank*rows_per_process; k<=local_Nz; k++)
for(j=1; j<=Ny; j++)
for(i=1; i<=Nx; i++)
old_T[i][j][k] = T[i][j][k];
cout<<rank<<" prints value of F[1] = "<<F[1]<<" m = "<<m<<endl;
MPI_Barrier(MPI_COMM_WORLD);
if(rank == 0)
{
outfile_SOR.precision(20);
outfile_SOR<<setw(20)<<t*n<<" "<<setw(20)<<T[1][1][1]<<" "<<endl;
cout<<"n = "<<n<<" Nt = "<<Nt<<endl;
}
for(k=1+rank*rows_per_process; k<=local_Nz; k++)
cout<<"rank = "<<rank<<" Temp = "<<T[temp1][temp2][k]<<" local Nz
= "<<local_Nz<<endl;
MPI_Barrier(MPI_COMM_WORLD);
}
-------------------------------------------------------
and subroutine for data excahnge is as follows:
-------------------------------------------------------------
void exchange_interface_data(int rank, int local_Nz, int comm_tag)
{
int err;
MPI_Status status;
MPI_Request request;
cout<<rank<<" printing from exchange_interface_data subroutine"<<endl;
if(rank%2==0)
MPI_Send(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE, rank+1,
comm_tag+rank, MPI_COMM_WORLD);
if(rank%2==1)
MPI_Recv(&T[1][1][1+rows_per_process*rank], Nx*Ny, MPI_DOUBLE, rank-1,
comm_tag+rank-1, MPI_COMM_WORLD, &status);
MPI_Wait(&request,&status);
if(err==1)
{
cout<<"Error in MPI_Send/Recv"<<endl;
}
if(rank%2==1)
MPI_Send(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank-1,
comm_tag+rank+50, MPI_COMM_WORLD);
if(rank%2==0)
MPI_Recv(&T[1][1][local_Nz], Nx*Ny, MPI_DOUBLE, rank+1,
comm_tag+rank+51, MPI_COMM_WORLD, &status);
cout<<"end of exchange_interface_data subroutine"<<endl;
}
----------------------------------------------
these are the outputs for two different cases:
*******************************************************
[rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
Mon Feb 14 23:37:07 2005
enter the value of physical time (in pico-seconds)
.01
enter the value of space step size in X dierction (nano-meter)
20
enter the value of space step size in Y dierction(nano-meter)
20
enter the value of space step size in Z dierction(nano-meter)
5
enter the number of rows/planes per processor
4
Enter the value of time step t =
.01
Nx = 26 Ny = 26 Nz = 21 delta t = 0.01 Nt = 1
a = 2171.39 b = -22.5012 c = -22.5012 d = -360.02
This is rank number - 0
rank = 0 Temp = 300 local Nz = 4
rank = 0 Temp = 300 local Nz = 4
rank = 0 Temp = 300 local Nz = 4
rank = 0 Temp = 300 local Nz = 4
Printing from F - rank = 0
0 printing from exchange_interface_data subroutine
This is rank number - 2
This is rank number - 1
rank = 1 Temp = 300 local Nz = 8
rank = 1 Temp = 300 local Nz = 8
rank = 1 Temp = 300 local Nz = 8
rank = 1 Temp = 300 local Nz = 8
Printing from F - rank = 1
1 printing from exchange_interface_data subroutine
end of exchange_interface_data subroutine
Printing from Red SOR: rank = 0
rank 0 max error norm is: 0.00610674
rank 0 max error norm is: 0.00291226
rank 0 max error norm is: 0.00138491
rank 0 max error norm is: 0.000656796
rank = 2 Temp = 300 local Nz = 12
rank = 2 Temp = 300 local Nz = 12
rank = 2 Temp = 300 local Nz = 12
rank = 2 Temp = 300 local Nz = 12
Printing from F - rank = 2
rank 0 max error norm is: 0.000310676
rank 0 max error norm is: 0.000146586
rank 0 max error norm is: 6.89966e-05
rank 0 max error norm is: 3.24e-05
rank 0 max error norm is: 1.51802e-05
rank 0 max error norm is: 7.0967e-06
rank 0 max error norm is: 3.31059e-06
rank 0 max error norm is: 1.54118e-06
2 printing from exchange_interface_data subroutine
rank 0 max error norm is: 7.16008e-07
RED: rank = 0 Total number of iterations performed: 13
0 printing from exchange_interface_data subroutine
MPI process rank 0 (n0, p4386) caught a SIGSEGV in MPI_Wait.
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Wait()
Rank (0, MPI_COMM_WORLD): - main()
This is rank number - 3
rank = 3 Temp = 300 local Nz = 16
rank = 3 Temp = 300 local Nz = 16
rank = 3 Temp = 300 local Nz = 16
rank = 3 Temp = 300 local Nz = 16
Printing from F - rank = 3
end of exchange_interface_data subroutine
Printing from Red SOR: rank = 1
rank 1 max error norm is: 0.00165236
rank 1 max error norm is: 0.000787998
rank 1 max error norm is: 0.000374727
rank 1 max error norm is: 0.000177715
rank 1 max error norm is: 8.40624e-05
rank 1 max error norm is: 3.96633e-05
3 printing from exchange_interface_data subroutine
rank 1 max error norm is: 1.86691e-05
end of exchange_interface_data subroutine
end of exchange_interface_data subroutine
Printing from Red SOR: rank = 2
rank 1 max error norm is: 8.76678e-06
rank 1 max error norm is: 4.10746e-06
rank 2 max error norm is: 0.000447094
rank 1 max error norm is: 1.92022e-06
rank 1 max error norm is: 8.95779e-07
RED: rank = 1 Total number of iterations performed: 11
1 printing from exchange_interface_data subroutine
rank 2 max error norm is: 0.000213216
MPI process rank 1 (n0, p4387) caught a SIGSEGV in MPI_Wait.
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Wait()
Rank (1, MPI_COMM_WORLD): - main()
rank 2 max error norm is: 0.000101393
rank 2 max error norm is: 4.80861e-05
rank 2 max error norm is: 2.27456e-05
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 4386 failed on node n0 with exit status 1.
-----------------------------------------------------------------------------
rank 2 max error norm is: 1.07321e-05
Printing from Red SOR: rank = 3
*****************************************************************
another output is:
******************************************
[rrkuma0_at_kfc1s1 SOR]$ mpirun -np 5 foo
Tue Feb 15 14:06:07 2005
enter the value of physical time (in pico-seconds)
.01
enter the value of space step size in X dierction (nano-meter)
10
enter the value of space step size in Y dierction(nano-meter)
10
enter the value of space step size in Z dierction(nano-meter)
2
enter the number of rows/planes per processor
10
Enter the value of time step t =
.01
Nx = 51 Ny = 51 Nz = 51 delta t = 0.01 Nt = 1
a = 6221.61 b = -90.005 c = -90.005 d = -2250.12
This is rank number - 0
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
rank = 0 Temp = 300 local Nz = 10
Printing from F - rank = 0
This is rank number - 1
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
rank = 1 Temp = 300 local Nz = 20
Printing from F - rank = 1
This is rank number - 2
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
rank = 2 Temp = 300 local Nz = 30
Printing from F - rank = 2
This is rank number - 4
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
rank = 4 Temp = 300 local Nz = 51
Printing from F - rank = 4
2 printing from exchange_interface_data subroutine
This is rank number - 3
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
rank = 3 Temp = 300 local Nz = 40
Printing from F - rank = 3
1 printing from exchange_interface_data subroutine
3 printing from exchange_interface_data subroutine
end of exchange_interface_data subroutine
Printing from Red SOR: rank = 3
end of exchange_interface_data subroutine
Printing from Red SOR: rank = 2
rank 2 max error norm is: 0.000159936
rank 2 max error norm is: 7.79543e-05
0 printing from exchange_interface_data subroutine
end of exchange_interface_data subroutine
Printing from Red SOR: rank = 1
end of exchange_interface_data subroutine
rank 2 max error norm is: 3.77884e-05
Printing from Red SOR: rank = 0
rank 2 max error norm is: 1.82237e-05
rank 1 max error norm is: 0.000591088
rank 2 max error norm is: 8.74567e-06
rank 0 max error norm is: 0.00218453
rank 2 max error norm is: 4.17761e-06
rank 1 max error norm is: 0.000288101
rank 2 max error norm is: 1.9867e-06
rank 2 max error norm is: 9.40774e-07
RED: rank = 2 Total number of iterations performed: 8
2 printing from exchange_interface_data subroutine
MPI process rank 2 (n0, p5201) caught a SIGSEGV in MPI_Wait.
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Wait()
Rank (2, MPI_COMM_WORLD): - main()
4 printing from exchange_interface_data subroutine
MPI_Send: invalid rank (rank 4, MPI_COMM_WORLD)
Rank (4, MPI_COMM_WORLD): Call stack within LAM:
Rank (4, MPI_COMM_WORLD): - MPI_Send()
Rank (4, MPI_COMM_WORLD): - main()
rank 0 max error norm is: 0.00106476
rank 3 max error norm is: 4.32755e-05
rank 3 max error norm is: 2.10928e-05
rank 1 max error norm is: 0.000139657
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 5200 failed on node n0 with exit status 1.
-----------------------------------------------------------------------------
*********************************************
I have been trying to fix this but could not. Please if anyone can shed some
light on this, I will be oblidged. Please help me out.
Thanks!
Ravi R. Kumar
Research Assitant
318 RGAN, RTL
University of Kentucky
859 257-6336 x 80697
|