The first thing that I would check for is a version mismatch of LAM between
your nodes. It looks like when you have interactive shells, you're using
6.5.9. Double check that non-interactive shells also get version 6.5.9
(e.g., "rsh node22.local.net which lamboot").
On 7/30/06 12:40 AM, "Zubair Anwar" <zubair.anwar_at_[hidden]> wrote:
> I am having a strange problem to which I could not find answer on the list
> or the web.
>
> I just replaced a compute node in a cluster with a new machine. The cluster
> is behind a head node (that does not compute). Jobs are run by logging into
> one of the compute nodes, then changing directories to the executable
> directory on the head node, following by lambooting a machinefile and then
> mpirun.
>
> The problem is that while I can do mpirun from the machine in question, any
> boot schema that contains other nodes hangs. I must mention that a boot
> schema with any combination of the other machines works fine. It is only the
> new node (node22) that gives problems. Here is what I get when i run code
> with boot schema containing just node 22.
>
> node22.local.net 27: lamboot -v -d mac
>
> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>
> lamboot: boot schema file: mac
> lamboot: opening hostfile mac
> lamboot: found the following hosts:
> lamboot: n0 node22
> lamboot: resolved hosts:
> lamboot: n0 node22 --> 192.168.0.22
> lamboot: found 1 host node(s)
> lamboot: origin node is 0 (node22)
> Executing hboot on n0 (node22 - 2 CPUs)...
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> 192.168.0.22 -P 33629 -n 0 -o 0 ""
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> hboot: attempting to execute
> [1] 10912 lamd -H 192.168.0.22 -P 33629 -n 0 -o 0 -d
> topology done
> lamboot completed successfully
>
> and when i do mpirun i get:
>
> node22.local.net 28: mpirun -np 4 a.out
> hello world from processor 3
> hello world from processor 0
> hello world from processor 1
> hello world from processor 2
>
> However, a bootschema with node08, node19 and node22 followed by mpirun does
> the following (node08 and node19 are an example here; other nodes are fine
> too.. it is just node22 that causes problems).
>
> node08.local.net 31: lamboot -v -d machinefile
>
> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>
> lamboot: boot schema file: machinefile
> lamboot: opening hostfile machinefile
> lamboot: found the following hosts:
> lamboot: n0 node08
> lamboot: n1 node19
> lamboot: n2 node22
> lamboot: resolved hosts:
> lamboot: n0 node08 --> 192.168.0.8
> lamboot: n1 node19 --> 192.168.0.19
> lamboot: n2 node22 --> 192.168.0.22
> lamboot: found 3 host node(s)
> lamboot: origin node is 0 (node08)
> Executing hboot on n0 (node08 - 1 CPU)...
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> 192.168.0.8 -P 33101 -n 0 -o 0 ""
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> hboot: attempting to execute
> [1] 18080 lamd -H 192.168.0.8 -P 33101 -n 0 -o 0 -d
> Executing hboot on n1 (node19 - 1 CPU)...
> lamboot: attempting to execute "rsh node19 -n echo $SHELL"
> lamboot: got remote shell /bin/tcsh
> lamboot: attempting to execute "rsh node19 -n hboot -t -c lam-conf.lam -d -v
> -s -I "-H 192.168.0.8 -P 33101 -n 1 -o 0 ""
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 15483 lamd -H 192.168.0.8 -P 33101 -n 1 -o 0 -d
> Executing hboot on n2 (node22 - 1 CPU)...
> lamboot: attempting to execute "rsh node22 -n echo $SHELL"
> lamboot: got remote shell /bin/tcsh
> lamboot: attempting to execute "rsh node22 -n hboot -t -c lam-conf.lam -d -v
> -s -I "-H 192.168.0.8 -P 33101 -n 2 -o 0 ""
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 10965 lamd -H 192.168.0.8 -P 33101 -n 2 -o 0 -d
> topology done
> lamboot completed successfully
> node08.local.net 33: mpirun -v -np 2 a.out
> 18088 a.out running on n0 (o)
> 15485 a.out running on n1
> hello world from processor 0
> hello world from processor 1
> node08.local.net 34: mpirun -v -np 3 a.out
> 18090 a.out running on n0 (o)
> 15486 a.out running on n1
>
> Suspended
> node08.local.net 35:
>
> I had to do Ctrl+Z to abort. I can rsh back and forth and also do tping
> before the run. Any ideas what's going wrong?
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
|