On 8/2/06, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> Is /usr/bin/lamboot your 6.5.9 installation?
Yes, it is my installation.
The next thing to check is to see if the a.out that is found on all nodes
> was compiled by the same version of LAM/MPI. Do you have a networked
> filesystem? If so, this is probably a moot point, but if not, ensure that
> a.out matches across all nodes.
I have a networked filesystem and a.out was compiled using the same version
of LAM as that on all nodes.
When you mpirun across multiple nodes (including the problematic node), do
> you see a.out in the process table on node 22? Can you verify that LAM
> thinks that it launched on node 22 by using "mpirun -v"? Can you lamexec
> non-MPI applications across multiple nodes (including 22), such as
> "lamexec
> N hostname"?
Here is the output from lamboot and mpirun.
node08.local.net 31: lamboot -v machinefile
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
Executing hboot on n0 (node08 - 1 CPU)...
Executing hboot on n1 (node19 - 1 CPU)...
Executing hboot on n2 (node22 - 1 CPU)...
topology done
node08.local.net 32: mpirun -v -np 3 a.out
29973 a.out running on n0 (o)
27169 a.out running on n1
Suspended
I do not see a.out in the process table on node22, but I do see lamd on
node22.
I am also unable to lamexec a simple non-MPI hello world program across
multiple nodes if the application schema contains node22.
>
>
> On 8/2/06 12:37 AM, "Zubair Anwar" <zubair.anwar_at_[hidden]> wrote:
>
> > Thanks for the reply. I get the following output from lamboot -V
> >
> > node08.local.net 29: lamboot -V
> >
> > LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
> >
> > Arch: i386-redhat-linux-gnu
> > RPI: usysv
> > node08.local.net 30: exit
> > rlogin: connection closed.
> > node22.local.net 28: lamboot -V
> >
> > LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
> >
> > Arch: i386-redhat-linux-gnu
> > RPI: usysv
> >
> > When I do "rsh node22.local.net which lamboot" i get
> > /usr/bin/lamboot
> >
> >
> >
> > On 8/1/06, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> >>
> >> The first thing that I would check for is a version mismatch of LAM
> >> between
> >> your nodes. It looks like when you have interactive shells, you're
> using
> >> 6.5.9. Double check that non-interactive shells also get version 6.5.9
> >> (e.g., "rsh node22.local.net which lamboot").
> >>
> >>
> >> On 7/30/06 12:40 AM, "Zubair Anwar" <zubair.anwar_at_[hidden]> wrote:
> >>
> >>> I am having a strange problem to which I could not find answer on the
> >> list
> >>> or the web.
> >>>
> >>> I just replaced a compute node in a cluster with a new machine. The
> >> cluster
> >>> is behind a head node (that does not compute). Jobs are run by logging
> >> into
> >>> one of the compute nodes, then changing directories to the executable
> >>> directory on the head node, following by lambooting a machinefile and
> >> then
> >>> mpirun.
> >>>
> >>> The problem is that while I can do mpirun from the machine in
> question,
> >> any
> >>> boot schema that contains other nodes hangs. I must mention that a
> boot
> >>> schema with any combination of the other machines works fine. It is
> only
> >> the
> >>> new node (node22) that gives problems. Here is what I get when i run
> >> code
> >>> with boot schema containing just node 22.
> >>>
> >>> node22.local.net 27: lamboot -v -d mac
> >>>
> >>> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
> >>>
> >>> lamboot: boot schema file: mac
> >>> lamboot: opening hostfile mac
> >>> lamboot: found the following hosts:
> >>> lamboot: n0 node22
> >>> lamboot: resolved hosts:
> >>> lamboot: n0 node22 --> 192.168.0.22
> >>> lamboot: found 1 host node(s)
> >>> lamboot: origin node is 0 (node22)
> >>> Executing hboot on n0 (node22 - 2 CPUs)...
> >>> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> >>> 192.168.0.22 -P 33629 -n 0 -o 0 ""
> >>> hboot: process schema = "/etc/lam/lam-conf.lam"
> >>> hboot: found /usr/bin/lamd
> >>> hboot: performing tkill
> >>> hboot: tkill
> >>> hboot: booting...
> >>> hboot: fork /usr/bin/lamd
> >>> hboot: attempting to execute
> >>> [1] 10912 lamd -H 192.168.0.22 -P 33629 -n 0 -o 0 -d
> >>> topology done
> >>> lamboot completed successfully
> >>>
> >>> and when i do mpirun i get:
> >>>
> >>> node22.local.net 28: mpirun -np 4 a.out
> >>> hello world from processor 3
> >>> hello world from processor 0
> >>> hello world from processor 1
> >>> hello world from processor 2
> >>>
> >>> However, a bootschema with node08, node19 and node22 followed by
> mpirun
> >> does
> >>> the following (node08 and node19 are an example here; other nodes are
> >> fine
> >>> too.. it is just node22 that causes problems).
> >>>
> >>> node08.local.net 31: lamboot -v -d machinefile
> >>>
> >>> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
> >>>
> >>> lamboot: boot schema file: machinefile
> >>> lamboot: opening hostfile machinefile
> >>> lamboot: found the following hosts:
> >>> lamboot: n0 node08
> >>> lamboot: n1 node19
> >>> lamboot: n2 node22
> >>> lamboot: resolved hosts:
> >>> lamboot: n0 node08 --> 192.168.0.8
> >>> lamboot: n1 node19 --> 192.168.0.19
> >>> lamboot: n2 node22 --> 192.168.0.22
> >>> lamboot: found 3 host node(s)
> >>> lamboot: origin node is 0 (node08)
> >>> Executing hboot on n0 (node08 - 1 CPU)...
> >>> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> >>> 192.168.0.8 -P 33101 -n 0 -o 0 ""
> >>> hboot: process schema = "/etc/lam/lam-conf.lam"
> >>> hboot: found /usr/bin/lamd
> >>> hboot: performing tkill
> >>> hboot: tkill
> >>> hboot: booting...
> >>> hboot: fork /usr/bin/lamd
> >>> hboot: attempting to execute
> >>> [1] 18080 lamd -H 192.168.0.8 -P 33101 -n 0 -o 0 -d
> >>> Executing hboot on n1 (node19 - 1 CPU)...
> >>> lamboot: attempting to execute "rsh node19 -n echo $SHELL"
> >>> lamboot: got remote shell /bin/tcsh
> >>> lamboot: attempting to execute "rsh node19 -n hboot -t -c
> lam-conf.lam-d -v
> >>> -s -I "-H 192.168.0.8 -P 33101 -n 1 -o 0 ""
> >>> hboot: process schema = "/etc/lam/lam-conf.lam"
> >>> hboot: found /usr/bin/lamd
> >>> hboot: performing tkill
> >>> hboot: tkill
> >>> hboot: booting...
> >>> hboot: fork /usr/bin/lamd
> >>> [1] 15483 lamd -H 192.168.0.8 -P 33101 -n 1 -o 0 -d
> >>> Executing hboot on n2 (node22 - 1 CPU)...
> >>> lamboot: attempting to execute "rsh node22 -n echo $SHELL"
> >>> lamboot: got remote shell /bin/tcsh
> >>> lamboot: attempting to execute "rsh node22 -n hboot -t -c
> lam-conf.lam-d -v
> >>> -s -I "-H 192.168.0.8 -P 33101 -n 2 -o 0 ""
> >>> hboot: process schema = "/etc/lam/lam-conf.lam"
> >>> hboot: found /usr/bin/lamd
> >>> hboot: performing tkill
> >>> hboot: tkill
> >>> hboot: booting...
> >>> hboot: fork /usr/bin/lamd
> >>> [1] 10965 lamd -H 192.168.0.8 -P 33101 -n 2 -o 0 -d
> >>> topology done
> >>> lamboot completed successfully
> >>> node08.local.net 33: mpirun -v -np 2 a.out
> >>> 18088 a.out running on n0 (o)
> >>> 15485 a.out running on n1
> >>> hello world from processor 0
> >>> hello world from processor 1
> >>> node08.local.net 34: mpirun -v -np 3 a.out
> >>> 18090 a.out running on n0 (o)
> >>> 15486 a.out running on n1
> >>>
> >>> Suspended
> >>> node08.local.net 35:
> >>>
> >>> I had to do Ctrl+Z to abort. I can rsh back and forth and also do
> tping
> >>> before the run. Any ideas what's going wrong?
> >>> _______________________________________________
> >>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >>
> >> --
> >> Jeff Squyres
> >> Server Virtualization Business Unit
> >> Cisco Systems
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|