LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Vladimir Chalupecky (chalupec_at_[hidden])
Date: 2004-01-14 11:33:30


Hi,

I'm running fairly straighforward program for computing solutions of
parabolic PDEs using finite differences. From LAM-MPI I use only the Send,
Recv and Sendrecv functions. There is one problem and two questions:

Problem: running this program in parallel with mpirun (without -nsigs)
causes SIGFPE in all nodes except the n0 outputing the following message
(or similar, depending on number of nodes):

MPI process rank 3 (n3, p8344) caught a SIGFPE.
MPI process rank 1 (n1, p8820) caught a SIGFPE.
MPI process rank 2 (n2, p9553) caught a SIGFPE.

OR prints one of the following messages if running the program with
mpirun -nsigs:

---
MPI_Wait: process in local group is dead (rank 0, MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD):  - MPI_Wait()
Rank (0, MPI_COMM_WORLD):  - MPI_Sendrecv()
Rank (0, MPI_COMM_WORLD):  - main()
One of the processes ...
---
MPI_Wait: process in local group is dead (rank 0, MPI_COMM_WORLD)
One of the processes ...
--- (or only: )
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 9543 failed on node n2 with exit status 1.
---
Running this program serially works perfectly with expected results
without any FPE. I encounter the same problem running this program both on
8 node pentium 4 linux cluster with LAM 6.5.6 as well as on single p4
linux workstation with LAM 6.5.8 with one cpu (and putting for example 
cpu=2 in the hostfile).
Question 1: Using this information, can anybody guess where the problem
might be? Since serially the computation runs fine, I guess the problem
will not be in my program but in the way I use LAM. But on the other hand,
sending data betweens comps causes FPE?
Question 2: I'm using Debian 3.0 with default compiler 3.3.2 However, with 
this this compiler I cannot link LAM programs, while with 2.95 I can (Am I 
right there is binary incompatibility between these versions of gcc?). 
mpiCC uses g++. Therefore, I have to compile my programs with g++-2.95 and 
manually put the include directories and libraries. It works, but its a 
bit inconvenient. Is there a way how to tell mpiCC which compiler to use? 
As far as I can read the man page, the only parameter mpiCC has is 
-showme.
Thanks for advices and hints
Vladimir