Thanks for the reply. I get the following output from lamboot -V
node08.local.net 29: lamboot -V
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
Arch: i386-redhat-linux-gnu
RPI: usysv
node08.local.net 30: exit
rlogin: connection closed.
node22.local.net 28: lamboot -V
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
Arch: i386-redhat-linux-gnu
RPI: usysv
When I do "rsh node22.local.net which lamboot" i get
/usr/bin/lamboot
On 8/1/06, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> The first thing that I would check for is a version mismatch of LAM
> between
> your nodes. It looks like when you have interactive shells, you're using
> 6.5.9. Double check that non-interactive shells also get version 6.5.9
> (e.g., "rsh node22.local.net which lamboot").
>
>
> On 7/30/06 12:40 AM, "Zubair Anwar" <zubair.anwar_at_[hidden]> wrote:
>
> > I am having a strange problem to which I could not find answer on the
> list
> > or the web.
> >
> > I just replaced a compute node in a cluster with a new machine. The
> cluster
> > is behind a head node (that does not compute). Jobs are run by logging
> into
> > one of the compute nodes, then changing directories to the executable
> > directory on the head node, following by lambooting a machinefile and
> then
> > mpirun.
> >
> > The problem is that while I can do mpirun from the machine in question,
> any
> > boot schema that contains other nodes hangs. I must mention that a boot
> > schema with any combination of the other machines works fine. It is only
> the
> > new node (node22) that gives problems. Here is what I get when i run
> code
> > with boot schema containing just node 22.
> >
> > node22.local.net 27: lamboot -v -d mac
> >
> > LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
> >
> > lamboot: boot schema file: mac
> > lamboot: opening hostfile mac
> > lamboot: found the following hosts:
> > lamboot: n0 node22
> > lamboot: resolved hosts:
> > lamboot: n0 node22 --> 192.168.0.22
> > lamboot: found 1 host node(s)
> > lamboot: origin node is 0 (node22)
> > Executing hboot on n0 (node22 - 2 CPUs)...
> > lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> > 192.168.0.22 -P 33629 -n 0 -o 0 ""
> > hboot: process schema = "/etc/lam/lam-conf.lam"
> > hboot: found /usr/bin/lamd
> > hboot: performing tkill
> > hboot: tkill
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > hboot: attempting to execute
> > [1] 10912 lamd -H 192.168.0.22 -P 33629 -n 0 -o 0 -d
> > topology done
> > lamboot completed successfully
> >
> > and when i do mpirun i get:
> >
> > node22.local.net 28: mpirun -np 4 a.out
> > hello world from processor 3
> > hello world from processor 0
> > hello world from processor 1
> > hello world from processor 2
> >
> > However, a bootschema with node08, node19 and node22 followed by mpirun
> does
> > the following (node08 and node19 are an example here; other nodes are
> fine
> > too.. it is just node22 that causes problems).
> >
> > node08.local.net 31: lamboot -v -d machinefile
> >
> > LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
> >
> > lamboot: boot schema file: machinefile
> > lamboot: opening hostfile machinefile
> > lamboot: found the following hosts:
> > lamboot: n0 node08
> > lamboot: n1 node19
> > lamboot: n2 node22
> > lamboot: resolved hosts:
> > lamboot: n0 node08 --> 192.168.0.8
> > lamboot: n1 node19 --> 192.168.0.19
> > lamboot: n2 node22 --> 192.168.0.22
> > lamboot: found 3 host node(s)
> > lamboot: origin node is 0 (node08)
> > Executing hboot on n0 (node08 - 1 CPU)...
> > lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> > 192.168.0.8 -P 33101 -n 0 -o 0 ""
> > hboot: process schema = "/etc/lam/lam-conf.lam"
> > hboot: found /usr/bin/lamd
> > hboot: performing tkill
> > hboot: tkill
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > hboot: attempting to execute
> > [1] 18080 lamd -H 192.168.0.8 -P 33101 -n 0 -o 0 -d
> > Executing hboot on n1 (node19 - 1 CPU)...
> > lamboot: attempting to execute "rsh node19 -n echo $SHELL"
> > lamboot: got remote shell /bin/tcsh
> > lamboot: attempting to execute "rsh node19 -n hboot -t -c lam-conf.lam-d -v
> > -s -I "-H 192.168.0.8 -P 33101 -n 1 -o 0 ""
> > hboot: process schema = "/etc/lam/lam-conf.lam"
> > hboot: found /usr/bin/lamd
> > hboot: performing tkill
> > hboot: tkill
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > [1] 15483 lamd -H 192.168.0.8 -P 33101 -n 1 -o 0 -d
> > Executing hboot on n2 (node22 - 1 CPU)...
> > lamboot: attempting to execute "rsh node22 -n echo $SHELL"
> > lamboot: got remote shell /bin/tcsh
> > lamboot: attempting to execute "rsh node22 -n hboot -t -c lam-conf.lam-d -v
> > -s -I "-H 192.168.0.8 -P 33101 -n 2 -o 0 ""
> > hboot: process schema = "/etc/lam/lam-conf.lam"
> > hboot: found /usr/bin/lamd
> > hboot: performing tkill
> > hboot: tkill
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > [1] 10965 lamd -H 192.168.0.8 -P 33101 -n 2 -o 0 -d
> > topology done
> > lamboot completed successfully
> > node08.local.net 33: mpirun -v -np 2 a.out
> > 18088 a.out running on n0 (o)
> > 15485 a.out running on n1
> > hello world from processor 0
> > hello world from processor 1
> > node08.local.net 34: mpirun -v -np 3 a.out
> > 18090 a.out running on n0 (o)
> > 15486 a.out running on n1
> >
> > Suspended
> > node08.local.net 35:
> >
> > I had to do Ctrl+Z to abort. I can rsh back and forth and also do tping
> > before the run. Any ideas what's going wrong?
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|