Dear All,
I am enjoying my parallel environment, after I installed a bate version of LAM/MPI(7.01b4)
which works good for absoft(f90). But now I faced another problem:
We have a Linux system with 16 nodes(32 cpus) each node have 2 cpu with
shear memory, there are distributed memory between each node.
The problem is that our paralleled code works good for shear memory (1 node with 2 cpu), but failure for distributed
memory( 2 node with 4 cpu).
I also tested a very simple code like "hello" on the parallel, which works good with 16 nodes(32cpus).
I am not sure that the problem is in the LAM or in the paralleled code (I guess the problem may not in the LAM, just want to have some idea from you guys). The following is error massage, would you please give me some suggestions?
Thanks a lot,
Wei Zhang
CSE Inc.
------------------------------------------
cfd:master % mpirun -np 4 fds4_mpi.exe
Process 1 of 4 is alive on master.xx.xxx.com
Process 2 of 4 is alive on master.xx.xxx.com
Process 3 of 4 is alive on node2.xx.xxx.com
Process 4 of 4 is alive on node2.xx.xxx.com
MPI_Recv: process in local group is dead (rank 0, SSI:coll:smp:coord comm
for CID 0)
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Recv()
Rank (2, MPI_COMM_WORLD): - MPI_Gather()
Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
Rank (2, MPI_COMM_WORLD): - main()
----------------------------------------------------------------------------
-
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 14103 failed on node n1 (10.0.0.2) due to signal 9.
----------------------------------------------------------------------------
-
|