I think the problem is due to the problematic installation of the No. 5 node (where the No.13 cpu lies in). Please ensure the correct installation of lam-mpi and the password free environment for SSH communication on that node.
Â
Â
Â
Â
-------------------------------------------------------------------------------------------------------
Fang Huazhi
Tel: 13141399478
Skate Key Laboratory for Advanced Metals and Materials
University of Science and Technology Beijing, Beijing10083, P.R. China
-------------------------------------------------------------------------------------------------------
--- 10å¹´3æ24æ¥ï¼å¨ä¸, Junwei Huang <jwhuang1982_at_[hidden]> åéï¼
å件人: Junwei Huang <jwhuang1982_at_[hidden]>
主é¢: LAM: why the occurence of error depends on the number of processors
æ¶ä»¶äºº: lam_at_[hidden]
æ¥æ: 2010å¹´3æ24æ¥,å¨ä¸,ä¸å4:27
Hello,
I am using LAM/MPI on an old cluster and wonder if I can get
some help from this mail list. Here is the problem. I am using a 18
node cluster, each node has 2 CPUs and each CPU supports up to 2
threads. So I assume I can use 18*4 number of processors. As running
the following code, an error message will always pops up for np=30 or
np=60. But works fine for np=12, np=1. The error message is always the
same:
------------------------------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 22414 failed on node n12 (192.168.14.16) due to signal 11.
------------------------------------------------------------------------------------------------
Here is part of the code, where the node exit. All other PEs can
finish writing the file, except one processor. Appreciate if anyone
could share experiences in debuging
errors like this.
code:
....
sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID,
each PE opens a different file
     if ((fp=fopen(p_obsfile,"w"))==NULL)
         printf("PE_%d: The file %s cannot be
opened\n",my_rank,p_obsfile);
     for (int id=loc*my_rank;id<loc*(my_rank+1);id++){ //
loc=TotalNum/NumofPE
         //call a function to calculate U, the function will return the
finishing message
       // no communication is needed among processors
         for (int j=0;j<NUM;j++)
             fprintf (fp, "%f\n",U[j]); //output updated U
     }
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|