Hi All,
BACKGROUND:
1) Using lam-7.1.1 under FreeBSD 5.4
2) recon works on all nodes by rsh from the master node.
3) I can't seem to get $LAMHOME to "stick" after rebooting the master or
the nodes, and the non-interactive sh shell invoked by rsh doesn't seem
to search the usual places for things to add to my path. So I went for a
quick dirty fix - I wrote a script to put links to all the files in
usr/local/lam-mpi/bin into /usr/local/bin on each node. I don't think
this should be causing the problem, but thought I'd mention it as it's a
bit "non-standard".
4) Lamboot is launched from a script in my path called "slam" (Start
LAM), which contains the following line only:
lamboot -v -d -ssi boot rsh ~/.pantheonmap
where ~/.pantheonmap is the boot schema file containing the name of the
master and three slave nodes.
5) /usr/local/lam-mpi/bin is present on each node, as there's a mix of
processor types (AMD64, AMD32, P-3) so I wasn't sure if an nfs share
would work.
6) The home directory for each slave node is an nfs mount of the master
home directory.
PROBLEM:
Lamboot fails due to output on stderr from the slave nodes. I changed
the order of the nodes in the boot schema and the problem is consistent
across slave nodes. There are actullay two errors, one when hboot is
called, and one when tkill is called after hboot fails. I'll concentrate
on the first error here, as the second one appears to have a similar
cause. Some output:
$ slam
n-1<694> ssi:boot:open: opening
n-1<694> ssi:boot:open: looking for boot module named rsh
n-1<694> ssi:boot:open: opening boot module rsh
n-1<694> ssi:boot:open: opened boot module rsh
n-1<694> ssi:boot:select: initializing boot module rsh
n-1<694> ssi:boot:rsh: module initializing
n-1<694> ssi:boot:rsh:agent: rsh
n-1<694> ssi:boot:rsh:username: <same>
n-1<694> ssi:boot:rsh:verbose: 1000
n-1<694> ssi:boot:rsh:algorithm: linear
n-1<694> ssi:boot:rsh:no_n: 0
n-1<694> ssi:boot:rsh:no_profile: 0
n-1<694> ssi:boot:rsh:fast: 0
n-1<694> ssi:boot:rsh:ignore_stderr: 0
n-1<694> ssi:boot:rsh:priority: 10
n-1<694> ssi:boot:select: boot module available: rsh, priority: 10
n-1<694> ssi:boot:select: selected boot module rsh
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<694> ssi:boot:base: looking for boot schema in following directories:
n-1<694> ssi:boot:base: <current directory>
n-1<694> ssi:boot:base: $TROLLIUSHOME/etc
n-1<694> ssi:boot:base: $LAMHOME/etc
n-1<694> ssi:boot:base: /usr/local/lam-mpi/etc
n-1<694> ssi:boot:base: looking for boot schema file:
n-1<694> ssi:boot:base: /home/james/.pantheonmap
n-1<694> ssi:boot:base: found boot schema: /home/james/.pantheonmap
n-1<694> ssi:boot:rsh: found the following hosts:
n-1<694> ssi:boot:rsh: n0 pantheon (cpu=1)
n-1<694> ssi:boot:rsh: n1 euler (cpu=1)
n-1<694> ssi:boot:rsh: n2 lagrange (cpu=1)
n-1<694> ssi:boot:rsh: n3 taylor (cpu=1)
n-1<694> ssi:boot:rsh: resolved hosts:
n-1<694> ssi:boot:rsh: n0 pantheon --> 192.168.0.10 (origin)
n-1<694> ssi:boot:rsh: n1 euler --> 192.168.0.12
n-1<694> ssi:boot:rsh: n2 lagrange --> 192.168.0.11
n-1<694> ssi:boot:rsh: n3 taylor --> 192.168.0.13
n-1<694> ssi:boot:rsh: starting RTE procs
n-1<694> ssi:boot:base:linear: starting
n-1<694> ssi:boot:base:server: opening server TCP socket
n-1<694> ssi:boot:base:server: opened port 60547
n-1<694> ssi:boot:base:linear: booting n0 (pantheon)
n-1<694> ssi:boot:rsh: starting lamd on (pantheon)
n-1<694> ssi:boot:rsh: starting on n0 (pantheon): hboot -t -c
lam-conf.lamd -d -v -I -H 192.168.0.10 -P 60547 -n 0 -o 0
n-1<694> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back:
/tmp/lam-james_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file:
/tmp/lam-james_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-james_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-james_at_[hidden]/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-james_at_[hidden]/lam-killfile"
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1] 697 lamd -H 192.168.0.10 -P 60547 -n 0 -o 0 -d
hboot: attempting to execute
n-1<697> ssi:boot:open: opening
n-1<697> ssi:boot:open: looking for boot module named rsh
n-1<697> ssi:boot:open: opening boot module rsh
n-1<697> ssi:boot:open: opened boot module rsh
n-1<697> ssi:boot:select: initializing boot module rsh
n-1<697> ssi:boot:rsh: module initializing
n-1<697> ssi:boot:rsh:agent: rsh
n-1<697> ssi:boot:rsh:username: <same>
n-1<697> ssi:boot:rsh:verbose: 1000
n-1<697> ssi:boot:rsh:algorithm: linear
n-1<697> ssi:boot:rsh:no_n: 0
n-1<697> ssi:boot:rsh:no_profile: 0
n-1<697> ssi:boot:rsh:fast: 0
n-1<697> ssi:boot:rsh:ignore_stderr: 0
n-1<697> ssi:boot:rsh:priority: 10
n-1<697> ssi:boot:select: boot module available: rsh, priority: 10
n-1<697> ssi:boot:select: selected boot module rsh
n-1<697> ssi:boot:send_lamd: getting node ID from command line
n-1<697> ssi:boot:send_lamd: getting agent haddr from command line
n-1<697> ssi:boot:send_lamd: getting agent port from command line
n-1<697> ssi:boot:send_lamd: getting node ID from command line
n-1<697> ssi:boot:send_lamd: connecting to 192.168.0.10:60547, node id 0
n-1<697> ssi:boot:send_lamd: sending dli_port 57407
n-1<694> ssi:boot:rsh: successfully launched on n0 (pantheon)
n-1<694> ssi:boot:base:server: expecting connection from finite list
n-1<694> ssi:boot:base:server: got connection from 192.168.0.10
n-1<694> ssi:boot:base:server: this connection is expected (n0)
n-1<694> ssi:boot:base:server: remote lamd is at 192.168.0.10:57407
n-1<694> ssi:boot:base:linear: booting n1 (euler)
n-1<694> ssi:boot:rsh: starting lamd on (euler)
n-1<694> ssi:boot:rsh: starting on n1 (euler): hboot -t -c lam-conf.lamd
-d -v -s -I "-H 192.168.0.10 -P 60547 -n 1 -o 0"
n-1<694> ssi:boot:rsh: launching remotely
n-1<694> ssi:boot:rsh: attempting to execute: rsh euler -n 'echo $SHELL'
n-1<694> ssi:boot:rsh: remote shell /bin/sh
n-1<694> ssi:boot:rsh: attempting to execute: rsh euler -n '( ! [ -e
./.profile] || . ./.profile;' hboot -t -c lam-conf.lamd -d -v -s -I '"-H
192.168.0.10 -P 60547 -n 1 -o 0"' )
ERROR: LAM/MPI unexpectedly received the following on stderr:
[: missing ]
SOME OBSERVATIONS:
I followed the advice after the above error message and tried to execute
the following from the master node:
rsh euler -n '( ! [ -e ./.profile] || . ./profile;' hboot -t -c
lam-conf.lamd -d -v -s -I '"-H 192.168.0.10 -P 60547 -n 1 -o 0"' )
with the result:
Syntax error: ")" unexpected
Now, I read around the manuals, FAQ and this mailing list for a while,
and thought maybe it was the local (master) node complaining about the
syntax error, rather than the slave. So I moved a quotation mark to make
the above command:
rsh euler -n '( ! [ -e ./.profile] || . ./profile;' hboot -t -c
lam-conf.lamd -d -v -s -I '"-H 192.168.0.10 -P 60547 -n 1 -o 0" )'
(Note that the only change is switching the order of the quotation mark
and the close bracket - the last two characters of the command) with the
result:
$ rsh euler -n '( ! [ -e ./.profile] || . ./.profile;' hboot -t -c
lam-conf.lamd -d -v -s -I '"-H 192.168.0.10 -P 60547 -n 1 -o 0" )'
[: missing ]
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back:
/tmp/lam-james_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file:
/tmp/lam-james_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-james_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-james_at_[hidden]/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-james_at_[hidden]/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1] 3930 lamd -H 192.168.0.10 -P 60547 -n 1 -o 0 -d
$
Which I think indicates success, yesno?
NOTE:
Executing the tkill command which causes the second lam error message is
exactly the same, until I switch the order of the final two characters.
Then it seems to work. I won't reproduce the error messages etc., as I
don't think they'll give you any new information...
THE QUESTION (AT LAST):
Is it likely that I've messed up the configuration, or is it possible
that lam-7.1.1 is getting the rsh commands wrong (in the placement of
quotation marks)? Is there something I can do about this?
Thanks in advance for any advice or observations...
Cheers,
James
--
******************************************
Dr. J.R. Dorsey
CNR - ESPM - Ecosystem Science
105 Hilgard Hall
University of California, Berkeley
Berkeley, CA 94720 - 3110
E: j.dorsey_at_[hidden]
T: +1 510 642 9048
M: +1 510 499 4398
W: http://nature.berkeley.edu/biometlab/
|