All right, problem resolved - there was a tiny unconspicous variable read
from a file in process 0, but it wasn't Bcast'ed to other processes,
telling how often save the results. Therefore, while the process 0
expected data to save, other processes were sending boundary conditions
and therefore the program behaved in a different way each time it was
executed. Ufff, the problem is always there where you least look for it
:-)
Vladimir
On Wed, 14 Jan 2004, Vladimir Chalupecky wrote:
> Hi,
>
> I'm running fairly straighforward program for computing solutions of
> parabolic PDEs using finite differences. From LAM-MPI I use only the Send,
> Recv and Sendrecv functions. There is one problem and two questions:
>
> Problem: running this program in parallel with mpirun (without -nsigs)
> causes SIGFPE in all nodes except the n0 outputing the following message
> (or similar, depending on number of nodes):
>
> MPI process rank 3 (n3, p8344) caught a SIGFPE.
> MPI process rank 1 (n1, p8820) caught a SIGFPE.
> MPI process rank 2 (n2, p9553) caught a SIGFPE.
>
> OR prints one of the following messages if running the program with
> mpirun -nsigs:
> ---
> MPI_Wait: process in local group is dead (rank 0, MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Wait()
> Rank (0, MPI_COMM_WORLD): - MPI_Sendrecv()
> Rank (0, MPI_COMM_WORLD): - main()
>
> One of the processes ...
> ---
> MPI_Wait: process in local group is dead (rank 0, MPI_COMM_WORLD)
>
> One of the processes ...
> --- (or only: )
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 9543 failed on node n2 with exit status 1.
> ---
>
> Running this program serially works perfectly with expected results
> without any FPE. I encounter the same problem running this program both on
> 8 node pentium 4 linux cluster with LAM 6.5.6 as well as on single p4
> linux workstation with LAM 6.5.8 with one cpu (and putting for example
> cpu=2 in the hostfile).
>
> Question 1: Using this information, can anybody guess where the problem
> might be? Since serially the computation runs fine, I guess the problem
> will not be in my program but in the way I use LAM. But on the other hand,
> sending data betweens comps causes FPE?
>
> Question 2: I'm using Debian 3.0 with default compiler 3.3.2 However, with
> this this compiler I cannot link LAM programs, while with 2.95 I can (Am I
> right there is binary incompatibility between these versions of gcc?).
> mpiCC uses g++. Therefore, I have to compile my programs with g++-2.95 and
> manually put the include directories and libraries. It works, but its a
> bit inconvenient. Is there a way how to tell mpiCC which compiler to use?
> As far as I can read the man page, the only parameter mpiCC has is
> -showme.
>
> Thanks for advices and hints
>
> Vladimir
>
|