I am having a strange problem to which I could not find answer on the list
or the web.
I just replaced a compute node in a cluster with a new machine. The cluster
is behind a head node (that does not compute). Jobs are run by logging into
one of the compute nodes, then changing directories to the executable
directory on the head node, following by lambooting a machinefile and then
mpirun.
The problem is that while I can do mpirun from the machine in question, any
boot schema that contains other nodes hangs. I must mention that a boot
schema with any combination of the other machines works fine. It is only the
new node (node22) that gives problems. Here is what I get when i run code
with boot schema containing just node 22.
node22.local.net 27: lamboot -v -d mac
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
lamboot: boot schema file: mac
lamboot: opening hostfile mac
lamboot: found the following hosts:
lamboot: n0 node22
lamboot: resolved hosts:
lamboot: n0 node22 --> 192.168.0.22
lamboot: found 1 host node(s)
lamboot: origin node is 0 (node22)
Executing hboot on n0 (node22 - 2 CPUs)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
192.168.0.22 -P 33629 -n 0 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 10912 lamd -H 192.168.0.22 -P 33629 -n 0 -o 0 -d
topology done
lamboot completed successfully
and when i do mpirun i get:
node22.local.net 28: mpirun -np 4 a.out
hello world from processor 3
hello world from processor 0
hello world from processor 1
hello world from processor 2
However, a bootschema with node08, node19 and node22 followed by mpirun does
the following (node08 and node19 are an example here; other nodes are fine
too.. it is just node22 that causes problems).
node08.local.net 31: lamboot -v -d machinefile
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
lamboot: boot schema file: machinefile
lamboot: opening hostfile machinefile
lamboot: found the following hosts:
lamboot: n0 node08
lamboot: n1 node19
lamboot: n2 node22
lamboot: resolved hosts:
lamboot: n0 node08 --> 192.168.0.8
lamboot: n1 node19 --> 192.168.0.19
lamboot: n2 node22 --> 192.168.0.22
lamboot: found 3 host node(s)
lamboot: origin node is 0 (node08)
Executing hboot on n0 (node08 - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
192.168.0.8 -P 33101 -n 0 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 18080 lamd -H 192.168.0.8 -P 33101 -n 0 -o 0 -d
Executing hboot on n1 (node19 - 1 CPU)...
lamboot: attempting to execute "rsh node19 -n echo $SHELL"
lamboot: got remote shell /bin/tcsh
lamboot: attempting to execute "rsh node19 -n hboot -t -c lam-conf.lam -d -v
-s -I "-H 192.168.0.8 -P 33101 -n 1 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 15483 lamd -H 192.168.0.8 -P 33101 -n 1 -o 0 -d
Executing hboot on n2 (node22 - 1 CPU)...
lamboot: attempting to execute "rsh node22 -n echo $SHELL"
lamboot: got remote shell /bin/tcsh
lamboot: attempting to execute "rsh node22 -n hboot -t -c lam-conf.lam -d -v
-s -I "-H 192.168.0.8 -P 33101 -n 2 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 10965 lamd -H 192.168.0.8 -P 33101 -n 2 -o 0 -d
topology done
lamboot completed successfully
node08.local.net 33: mpirun -v -np 2 a.out
18088 a.out running on n0 (o)
15485 a.out running on n1
hello world from processor 0
hello world from processor 1
node08.local.net 34: mpirun -v -np 3 a.out
18090 a.out running on n0 (o)
15486 a.out running on n1
Suspended
node08.local.net 35:
I had to do Ctrl+Z to abort. I can rsh back and forth and also do tping
before the run. Any ideas what's going wrong?
|