On Mon, Aug 04, 2003 at 06:05:10PM -0400, Jeff Squyres wrote:
> This is the first odd thing -- the remote shell should not be "hboot".
> Looks like a bug in our error message. :-(
>
It seemed strange to me at first, but this is my first try at an mpi
implementation; I've used pvm so far.
> > So, as mentioned, I tried running that by hand:
> > pvm_at_darkstar:~/lam/etc$ ssh -x zeus -n hboot -t -c lam-conf.lamd -s -I "-H
> > 192.168.0.10 -P 41831 -n 2 -o 0"
> > pvm_at_darkstar:~/lam/etc$
>
> This is odd -- I would not expect hboot to finish properly here. The -P
> argument specifies a TCP port number that lamboot is listening on, waiting
> for the lamd to call back on. Hence, when lamboot dies, that port closes,
> and if you try to run it again, hboot/lamd should fail because it can't
> connect to that port.
>
I thought it worked that way, so I was suprised to see this
functioning too, as is why I performed the next step (printing the error
code)
> No, as long as you have executables setup properly in the $PATH for each
> machine/architecture/OS/whatever, you should be ok. If you have 7.0 on
> all your machines (regardless of arch/OS/etc.), they should interoperate
> properly.
>
All nodes run the same version, I've configured one home dir at first,
containing all sources, synced it across my nodes and then build it on each
node.
pvm_at_darkstar:~$ echo $PATH
/home/pvm/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/home/pvm/bin:/home/pvm/lam/bin:/home/pvm/pvm3/lib:/usr/local/bin:/usr/ccs/bin:/home/pvm/pvm3/bin/X86SOL2:/home/pvm/pvm3/lib
pvm_at_darkstar:~$
> Common problems here include the following:
>
> - firewalling/port blocking between the machines
It's on a private lan without any filter rules for the cluster.
> - not able to find hboot in your path on the remote machine (I didn't see
> an explicit path entry for LAM in your .ssh/environment file; but I
> don't know if you installed it in one of the "common" directories...?)
I've compiled it on every node with --prefix=/home/pvm/lam and have set my
path accordingly.
> - incorrect IP resolution (is 192.168.0.10 the right IP address for the
> host that you're lambooting from?)
>
It's a private lan, zeus is also my dns/dhcp server for other things, but
all hosts in the cluster resolve properly:
pvm_at_darkstar:~$ host darkstar
darkstar.thuis has address 192.168.0.10
pvm_at_darkstar:~$ host 192.168.0.10
10.0.168.192.in-addr.arpa domain name pointer darkstar.thuis.
pvm_at_darkstar:~$ host zeus
zeus.thuis has address 192.168.0.3
pvm_at_darkstar:~$ host 192.168.0.3
3.0.168.192.in-addr.arpa domain name pointer zeus.thuis.
pvm_at_darkstar:~$ host sauron
sauron.thuis has address 192.168.0.2
pvm_at_darkstar:~$ host 192.168.0.2
2.0.168.192.in-addr.arpa domain name pointer sauron.thuis.
pvm_at_darkstar:~$
> Can you send the full lamboot -d output?
>
sure:
Script started on Tue Aug 5 20:13:39 2003
pvm_at_darkstar:~$ lamboot -d
n0<20691> ssi:boot: Opening
n0<20691> ssi:boot: opening module globus
n0<20691> ssi:boot: initializing module globus
n0<20691> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n0<20691> ssi:boot: module not available: globus
n0<20691> ssi:boot: opening module rsh
n0<20691> ssi:boot: initializing module rsh
n0<20691> ssi:boot:rsh: module initializing
n0<20691> ssi:boot:rsh:agent: ssh -x
n0<20691> ssi:boot:rsh:username: <same>
n0<20691> ssi:boot:rsh:verbose: 1000
n0<20691> ssi:boot:rsh:algorithm: linear
n0<20691> ssi:boot:rsh:priority: 10
n0<20691> ssi:boot: module available: rsh, priority: 10
n0<20691> ssi:boot: finalizing module globus
n0<20691> ssi:boot:globus: finalizing
n0<20691> ssi:boot: closing module globus
n0<20691> ssi:boot: Selected boot module rsh
LAM 7.0/MPI 2 C++/ROMIO - Indiana University
n0<20691> ssi:boot:base: looking for boot schema in following directories:
n0<20691> ssi:boot:base: <current directory>
n0<20691> ssi:boot:base: $TROLLIUSHOME/etc
n0<20691> ssi:boot:base: $LAMHOME/etc
n0<20691> ssi:boot:base: /home/pvm/lam/etc
n0<20691> ssi:boot:base: looking for boot schema file:
n0<20691> ssi:boot:base: lam-bhost.def
n0<20691> ssi:boot:base: found boot schema: /home/pvm/lam/etc/lam-bhost.def
n0<20691> ssi:boot:rsh: found the following hosts:
n0<20691> ssi:boot:rsh: n0 darkstar (cpu=2)
n0<20691> ssi:boot:rsh: n1 sauron (cpu=1)
n0<20691> ssi:boot:rsh: n2 zeus (cpu=1)
n0<20691> ssi:boot:rsh: resolved hosts:
n0<20691> ssi:boot:rsh: n0 darkstar --> 192.168.0.10 (origin)
n0<20691> ssi:boot:rsh: n1 sauron --> 192.168.0.2
n0<20691> ssi:boot:rsh: n2 zeus --> 192.168.0.3
n0<20691> ssi:boot:rsh: starting RTE procs
n0<20691> ssi:boot:base:linear: starting
n0<20691> ssi:boot:base:server: opening server TCP socket
n0<20691> ssi:boot:base:server: opened port 33867
n0<20691> ssi:boot:base:linear: booting n0 (darkstar)
n0<20691> ssi:boot:rsh: starting lamd on (darkstar)
n0<20691> ssi:boot:rsh: starting on n0 (darkstar): hboot -t -c lam-conf.lamd
-d
-I -H 192.168.0.10 -P 33867 -n 0 -o 0
n0<20691> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-pvm_at_darkstar/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-pvm_at_darkstar/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-pvm_at_darkstar/lam-io-socket
tkill: f_kill = "/tmp/lam-pvm_at_darkstar/lam-killfile"
tkill: nothing to kill: "/tmp/lam-pvm_at_darkstar/lam-killfile"
hboot: booting...
hboot: fork /home/pvm/lam/bin/lamd
[1] 20694 lamd -H 192.168.0.10 -P 33867 -n 0 -o 0 -d
n0<20691> ssi:boot:rsh: successfully launched on n0 (darkstar)
n0<20691> ssi:boot:base:server: expecting connection from finite list
hboot: attempting to execute
n-1<20694> ssi:boot: Opening
n-1<20694> ssi:boot: opening module globus
n-1<20694> ssi:boot: initializing module globus
n-1<20694> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<20694> ssi:boot: module not available: globus
n-1<20694> ssi:boot: opening module rsh
n-1<20694> ssi:boot: initializing module rsh
n-1<20694> ssi:boot:rsh: module initializing
n-1<20694> ssi:boot:rsh:agent: ssh -x
n-1<20694> ssi:boot:rsh:username: <same>
n-1<20694> ssi:boot:rsh:verbose: 1000
n-1<20694> ssi:boot:rsh:algorithm: linear
n-1<20694> ssi:boot:rsh:priority: 10
n-1<20694> ssi:boot: module available: rsh, priority: 10
n-1<20694> ssi:boot: finalizing module globus
n-1<20694> ssi:boot:globus: finalizing
n-1<20694> ssi:boot: closing module globus
n-1<20694> ssi:boot: Selected boot module rsh
n0<20691> ssi:boot:base:server: got connection from 192.168.0.10
n0<20691> ssi:boot:base:server: this connection is expected (n0)
n0<20691> ssi:boot:base:server: remote lamd is at 192.168.0.10:2688
n0<20691> ssi:boot:base:linear: booting n1 (sauron)
n0<20691> ssi:boot:rsh: starting lamd on (sauron)
n0<20691> ssi:boot:rsh: starting on n1 (sauron): hboot -t -c lam-conf.lamd
-d -s
-I "-H 192.168.0.10 -P 33867 -n 1 -o 0"
n0<20691> ssi:boot:rsh: launching remotely
n0<20691> ssi:boot:rsh: attempting to execute "ssh -x sauron -n echo $SHELL"
n0<20691> ssi:boot:rsh: remote shell /bin/bash
n0<20691> ssi:boot:rsh: attempting to execute "ssh -x sauron -n hboot -t -c
lam-conf.lamd -d -s -I "-H 192.168.0.10 -P 33867 -n 1 -o 0""
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-pvm_at_sauron/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-pvm_at_sauron/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-pvm_at_sauron/lam-io-sockettkill: f_kill
= "/tmp/lam-pvm_at_sauron/lam-killfile"
tkill: nothing to kill: "/tmp/lam-pvm_at_sauron/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /home/pvm/lam/bin/lamd
[1] 8708 lamd -H 192.168.0.10 -P 33867 -n 1 -o 0 -d
n0<20691> ssi:boot:rsh: successfully launched on n1 (sauron)
n0<20691> ssi:boot:base:server: expecting connection from finite list
n0<20691> ssi:boot:base:server: got connection from 192.168.0.2
n0<20691> ssi:boot:base:server: this connection is expected (n1)
n0<20691> ssi:boot:base:server: remote lamd is at 192.168.0.2:20740
n0<20691> ssi:boot:base:linear: booting n2 (zeus)
n0<20691> ssi:boot:rsh: starting lamd on (zeus)
n0<20691> ssi:boot:rsh: starting on n2 (zeus): hboot -t -c lam-conf.lamd -d
-s -I "-H 192.168.0.10 -P 33867 -n 2 -o 0"
n0<20691> ssi:boot:rsh: launching remotely
n0<20691> ssi:boot:rsh: attempting to execute "ssh -x zeus -n echo $SHELL"
n0<20691> ssi:boot:rsh: remote shell /usr/bin/bash
n0<20691> ssi:boot:rsh: attempting to execute "ssh -x zeus -n hboot -t -c
lam-conf.lamd -d -s -I "-H 192.168.0.10 -P 33867 -n 2 -o 0""
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-pvm_at_zeus/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-pvm_at_zeus/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-pvm_at_zeus/lam-io-socket
tkill: f_kill = "/tmp/lam-pvm_at_zeus/lam-killfile"
tkill: nothing to kill: "/tmp/lam-pvm_at_zeus/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /home/pvm/lam/bin/lamd
[1] 13688 lamd -H 192.168.0.10 -P 33867 -n 2 -o 0 -d
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "zeus".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.
LAM tried to use the remote agent command "ssh"
to invoke the following command:
ssh -x zeus -n hboot -t -c lam-conf.lamd -d -s -I "-H 192.168.0.10
-P 3386
7 -n 2 -o 0"
This can indicate several things. You should check the following:
- The LAM binaries are in your $PATH
- You can run the LAM binaries
- The $PATH variable is set properly before your
.cshrc/.profile exits
Try to invoke the command listed above manually at a Unix prompt.
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n0<20691> ssi:boot:base:linear: Failed to boot n2 (zeus)
n0<20691> ssi:boot:base:server: closing server socket
n0<20691> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
[snipped closing down]
One thing I find strange in above output is the following line;
[1] 13688 lamd -H 192.168.0.10 -P 33867 -n 2 -o 0 -d
Doesn't that show a correctly spawned lamd?
I've been using pvm for some time now, so I'm not really in a hurry getting
lam to boot properly, but I'm thinking of using a couple of dozen machines
in the near future and it would be nice to have several implementations
running on that cluster so the few users can use their preferred api.
(I won't mix hardware/software with that one ;-)
I suspect solaris to hold something back for me, but I've no luck so far
telling what exactly; All what's needed is in $PATH, and for some things I
set $LD_LIBRARY_PATH. Both are functioning correctly when I run them by hand
by logging into the machine, and whenever I use ssh -x from any node.
Hope the output is helpfull, and many thanks for your help :-)
Btw, does the master lamd have a fixed port, or could I set that in some
config file (sorry, a bit short of time now) so I can tcpdump just lamd
booting?
gr,
--
VIA NET.WORKS Nederland
Axel Scheepers
System Administrator UNIX
phone +31 40 239 33 93
fax +31 40 239 33 11
e-mail ascheepers_at_[hidden]
pgp id 21A33FE0
http://www.vianetworks.nl/
|