LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-09-18 14:53:48


On Tue, 16 Sep 2003, wei zhang wrote:

> I am enjoying my parallel environment, after I installed a bate version
> of LAM/MPI(7.01b4) which works good for absoft(f90). But now I faced
> another problem: We have a Linux system with 16 nodes(32 cpus) each node
> have 2 cpu with shear memory, there are distributed memory between each
> node.

Do you have software that effects this distributed shared memory, or are
you referring to the fact that LAM is being used to effect parallelism,
and that LAM is using shared memory between the processes on one node and
other mechainsms (TCP?) between processes on different nodes?

> The problem is that our paralleled code works good for shear memory (1
> node with 2 cpu), but failure for distributed memory( 2 node with 4
> cpu). I also tested a very simple code like "hello" on the parallel,
> which works good with 16 nodes(32cpus). I am not sure that the problem
> is in the LAM or in the paralleled code (I guess the problem may not in
> the LAM, just want to have some idea from you guys). The following is
> error massage, would you please give me some suggestions?

I'm going to assume that you mean that LAM is effecting all the
parallelism, and that you are not using a distributed shared memory
package. If you're using DSM, that might complicate things...

> cfd:master % mpirun -np 4 fds4_mpi.exe
> Process 1 of 4 is alive on master.xx.xxx.com
> Process 2 of 4 is alive on master.xx.xxx.com
> Process 3 of 4 is alive on node2.xx.xxx.com
> Process 4 of 4 is alive on node2.xx.xxx.com
> MPI_Recv: process in local group is dead (rank 0, SSI:coll:smp:coord comm
> for CID 0)
> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> Rank (2, MPI_COMM_WORLD): - MPI_Recv()
> Rank (2, MPI_COMM_WORLD): - MPI_Gather()
> Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (2, MPI_COMM_WORLD): - main()

This means that one of your MPI processes has died during a collective
operation (MPI_Barrier). The obscure communicator name
(SSI:coll:smp:coord) simply means that it was running through the smp
collective module.

Can you run your code through a memory checking debugger such as valgrind?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/