Hello,
I am using LAM/MPI on an old cluster and wonder if I can get
some help from this mail list. Here is the problem. I am using a 18
node cluster, each node has 2 CPUs and each CPU supports up to 2
threads. So I assume I can use 18*4 number of processors. As running
the following code, an error message will always pops up for np=30 or
np=60. But works fine for np=12, np=1. The error message is always the
same:
------------------------------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 22414 failed on node n12 (192.168.14.16) due to signal 11.
------------------------------------------------------------------------------------------------
Here is part of the code, where the node exit. All other PEs can
finish writing the file, except one processor. Appreciate if anyone
could share experiences in debuging
errors like this.
code:
....
sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID,
each PE opens a different file
if ((fp=fopen(p_obsfile,"w"))==NULL)
printf("PE_%d: The file %s cannot be
opened\n",my_rank,p_obsfile);
for (int id=loc*my_rank;id<loc*(my_rank+1);id++){ //
loc=TotalNum/NumofPE
//call a function to calculate U, the function will return the
finishing message
// no communication is needed among processors
for (int j=0;j<NUM;j++)
fprintf (fp, "%f\n",U[j]); //output updated U
}
|