LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pravin R Joshi (pravenj_at_[hidden])
Date: 2004-09-09 13:58:19


This is my output from tping:
  1 byte from 1 remote node and 1 local node: 0.001 secs
  1 byte from 1 remote node and 1 local node: 0.001 secs
2 messages, 2 bytes (0.002K), 0.002 secs (2.009K/sec)
roundtrip min/avg/max: 0.001/0.001/0.001

and from Lamnodes:

lamnodes
n0 medusa.lab.ac.uab.edu:1:origin,this_node
n1 node1:1:

So I guess both the machines are visible to each other. The program reduc is
on a NFS mounted disk. So the same copy of it is getting passed to both the
machines.

Pravin

On Thursday 09 September 2004 12:31, Jeff Squyres wrote:
> So if LAM is booted on both nodes, double check this with the "tping"
> command, for example:
>
> tping -c 2 N
>
> And ensure that both nodes can be "seen". Also run lamnodes and verify
> that LAM thinks that there is only 1 CPU on each machine (i.e., mpirun
> is not trying to run 2 copies of reduc on one machine).
>
> It sounds like mpirun tried to launch 2 copies of reduc -- and as far
> as it knows, it *did* launch 2 copies of reduc (probably one on each
> node), but the reduc that it found on one machine was not an MPI
> process. Specifically, you *should* get a "file not found" error if it
> can't find reduc on one machine. So it must be finding it, but perhaps
> it's finding the "wrong" reduc (i.e., one that is not an MPI process
> and does not call MPI_INIT)...?
>
> On Sep 9, 2004, at 12:30 PM, Pravin R Joshi wrote:
> > Hi,
> > I am trying to get LAM/MPI 7.0.6 working on two nodes of a cluster
> > using
> > RedHat Linux 9. I installed a rpm copy on one of the nodes and from
> > source on
> > another node. Now when I do a lamboot -v hostfile (hostfile has the
> > names of
> > the two machines) lam is booted on both the nodes, but when I run an
> > mpi
> > program (eg.: mpirun -np 2 reduc), only one instance of the mpirun is
> > started. This one is on the node in which I did a source install. The
> > other
> > node does not start the mpirun.
> > At the end of the mpirun I get the following error.
> > -----------------------------------------------------------------------
> > -------------------------------------
> > It seems that [at least] one of the processes that was started with
> > mpirun did not invoke MPI_INIT before quitting (it is possible that
> > more than one process did not invoke MPI_INIT -- mpirun was only
> > notified of the first one, which was on node n0).
> >
> > mpirun can *only* be used with MPI programs (i.e., programs that
> > invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> > to run non-MPI programs over the lambooted nodes.
> > -----------------------------------------------------------------------
> > ------------------------------------
> > Can someone help with this please.
> > Pravin
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/