Hi Jeff,
It was a little while I didn't work on LAM,
recently I reinstalled lam7.02 and both shear memory
and distributed memory cpus works nice.
We also got petty high speed up rate.
Thank you for your help.
Wei
----- Original Message -----
From: "Jeff Squyres" <jsquyres_at_[hidden]>
To: "General LAM/MPI mailing list" <lam_at_[hidden]>
Sent: Thursday, September 18, 2003 3:53 PM
Subject: Re: LAM: failure for distributed memory
> On Tue, 16 Sep 2003, wei zhang wrote:
>
> > I am enjoying my parallel environment, after I installed a bate version
> > of LAM/MPI(7.01b4) which works good for absoft(f90). But now I faced
> > another problem: We have a Linux system with 16 nodes(32 cpus) each node
> > have 2 cpu with shear memory, there are distributed memory between each
> > node.
>
> Do you have software that effects this distributed shared memory, or are
> you referring to the fact that LAM is being used to effect parallelism,
> and that LAM is using shared memory between the processes on one node and
> other mechainsms (TCP?) between processes on different nodes?
>
> > The problem is that our paralleled code works good for shear memory (1
> > node with 2 cpu), but failure for distributed memory( 2 node with 4
> > cpu). I also tested a very simple code like "hello" on the parallel,
> > which works good with 16 nodes(32cpus). I am not sure that the problem
> > is in the LAM or in the paralleled code (I guess the problem may not in
> > the LAM, just want to have some idea from you guys). The following is
> > error massage, would you please give me some suggestions?
>
> I'm going to assume that you mean that LAM is effecting all the
> parallelism, and that you are not using a distributed shared memory
> package. If you're using DSM, that might complicate things...
>
> > cfd:master % mpirun -np 4 fds4_mpi.exe
> > Process 1 of 4 is alive on master.xx.xxx.com
> > Process 2 of 4 is alive on master.xx.xxx.com
> > Process 3 of 4 is alive on node2.xx.xxx.com
> > Process 4 of 4 is alive on node2.xx.xxx.com
> > MPI_Recv: process in local group is dead (rank 0, SSI:coll:smp:coord
comm
> > for CID 0)
> > Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (2, MPI_COMM_WORLD): - MPI_Recv()
> > Rank (2, MPI_COMM_WORLD): - MPI_Gather()
> > Rank (2, MPI_COMM_WORLD): - MPI_Barrier()
> > Rank (2, MPI_COMM_WORLD): - main()
>
> This means that one of your MPI processes has died during a collective
> operation (MPI_Barrier). The obscure communicator name
> (SSI:coll:smp:coord) simply means that it was running through the smp
> collective module.
>
> Can you run your code through a memory checking debugger such as valgrind?
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|