Thanks for the message. I can ssh to all nodes without a password. I
suspect the problem is in the parallel I/O, i.e., simply using
fprintf() may cause race condition as more than one processor are
trying to write strided data to the hard disk. I have no clue how
processors write into different files on the same hard disk. Any
references on this issue or other suggestions on the problem? Thank
you.
On 3/23/10, hz fang <ustbmars_at_[hidden]> wrote:
> I think the problem is due to the problematic installation of the No. 5 node
> (where the No.13 cpu lies in). Please ensure the correct installation of
> lam-mpi and the password free environment for SSH communication on that
> node.
>
>
>
>
>
>
> -------------------------------------------------------------------------------------------------------
> Fang Huazhi
> Tel: 13141399478
> Skate Key Laboratory for Advanced Metals and Materials
> University of Science and Technology Beijing, Beijing10083, P.R. China
> -------------------------------------------------------------------------------------------------------
>
> --- 10Äê3ÔÂ24ÈÕ£¬ÖÜÈý, Junwei Huang <jwhuang1982_at_[hidden]> дµÀ£º
>
>
> ·¢¼þÈË: Junwei Huang <jwhuang1982_at_[hidden]>
> Ö÷Ìâ: LAM: why the occurence of error depends on the number of processors
> ÊÕ¼þÈË: lam_at_[hidden]
> ÈÕÆÚ: 2010Äê3ÔÂ24ÈÕ,ÖÜÈý,ÉÏÎç4:27
>
>
> Hello,
> I am using LAM/MPI on an old cluster and wonder if I can get
> some help from this mail list. Here is the problem. I am using a 18
> node cluster, each node has 2 CPUs and each CPU supports up to 2
> threads. So I assume I can use 18*4 number of processors. As running
> the following code, an error message will always pops up for np=30 or
> np=60. But works fine for np=12, np=1. The error message is always the
> same:
> ------------------------------------------------------------------------------------------------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 22414 failed on node n12 (192.168.14.16) due to signal 11.
> ------------------------------------------------------------------------------------------------
>
> Here is part of the code, where the node exit. All other PEs can
> finish writing the file, except one processor. Appreciate if anyone
> could share experiences in debuging
> errors like this.
>
> code:
> ....
> sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID,
> each PE opens a different file
> if ((fp=fopen(p_obsfile,"w"))==NULL)
> printf("PE_%d: The file %s cannot be
> opened\n",my_rank,p_obsfile);
>
> for (int id=loc*my_rank;id<loc*(my_rank+1);id++){ //
> loc=TotalNum/NumofPE
> //call a function to calculate U, the function will return
> the
> finishing message
> // no communication is needed among processors
> for (int j=0;j<NUM;j++)
> fprintf (fp, "%f\n",U[j]); //output updated U
> }
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
>
|