Here is log of lamboot -d
I do this:
[root_at_swingle lam]# telnet 192.168.3.1 32778
Trying 192.168.3.1...
telnet: connect to address 192.168.3.1: Connection refused
telnet: Unable to connect to remote host: Connection refused
LOG:
[swingle_at_swingle lam]$ lamboot -d
n-1<6372> ssi:boot:open: opening
n-1<6372> ssi:boot:open: opening boot module globus
n-1<6372> ssi:boot:open: opened boot module globus
n-1<6372> ssi:boot:open: opening boot module rsh
n-1<6372> ssi:boot:open: opened boot module rsh
n-1<6372> ssi:boot:open: opening boot module slurm
n-1<6372> ssi:boot:open: opened boot module slurm
n-1<6372> ssi:boot:select: initializing boot module slurm
n-1<6372> ssi:boot:slurm: not running under SLURM
n-1<6372> ssi:boot:select: boot module not available: slurm
n-1<6372> ssi:boot:select: initializing boot module rsh
n-1<6372> ssi:boot:rsh: module initializing
n-1<6372> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<6372> ssi:boot:rsh:username: <same>
n-1<6372> ssi:boot:rsh:verbose: 1000
n-1<6372> ssi:boot:rsh:algorithm: linear
n-1<6372> ssi:boot:rsh:no_n: 0
n-1<6372> ssi:boot:rsh:no_profile: 0
n-1<6372> ssi:boot:rsh:fast: 0
n-1<6372> ssi:boot:rsh:ignore_stderr: 0
n-1<6372> ssi:boot:rsh:priority: 10
n-1<6372> ssi:boot:select: boot module available: rsh, priority: 10
n-1<6372> ssi:boot:select: initializing boot module globus
n-1<6372> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<6372> ssi:boot:select: boot module not available: globus
n-1<6372> ssi:boot:select: finalizing boot module slurm
n-1<6372> ssi:boot:slurm: finalizing
n-1<6372> ssi:boot:select: closing boot module slurm
n-1<6372> ssi:boot:select: finalizing boot module globus
n-1<6372> ssi:boot:globus: finalizing
n-1<6372> ssi:boot:select: closing boot module globus
n-1<6372> ssi:boot:select: selected boot module rsh
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<6372> ssi:boot:base: looking for boot schema in following directories:
n-1<6372> ssi:boot:base: <current directory>
n-1<6372> ssi:boot:base: $TROLLIUSHOME/etc
n-1<6372> ssi:boot:base: $LAMHOME/etc
n-1<6372> ssi:boot:base: /etc/lam
n-1<6372> ssi:boot:base: looking for boot schema file:
n-1<6372> ssi:boot:base: lam-bhost.def
n-1<6372> ssi:boot:base: found boot schema: lam-bhost.def
n-1<6372> ssi:boot:rsh: found the following hosts:
n-1<6372> ssi:boot:rsh: n0 swingle (cpu=1)
n-1<6372> ssi:boot:rsh: n1 swingle3 (cpu=1)
n-1<6372> ssi:boot:rsh: resolved hosts:
n-1<6372> ssi:boot:rsh: n0 swingle --> 192.168.3.1 (origin)
n-1<6372> ssi:boot:rsh: n1 swingle3 --> 192.168.3.3
n-1<6372> ssi:boot:rsh: starting RTE procs
n-1<6372> ssi:boot:base:linear: starting
n-1<6372> ssi:boot:base:server: opening server TCP socket
n-1<6372> ssi:boot:base:server: opened port 32778
n-1<6372> ssi:boot:base:linear: booting n0 (swingle)
n-1<6372> ssi:boot:rsh: starting lamd on (swingle)
n-1<6372> ssi:boot:rsh: starting on n0 (swingle): hboot -t -c
lam-conf.lamd -d -I -H 192.168.3.1 -P 32778 -n 0 -o 0
n-1<6372> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-swingle_at_swingle/lam-killfiletkill:
removing socket file ...
tkill: socket file: /tmp/lam-swingle_at_swingle/lam-kernel-socketdtkill:
removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-swingle_at_swingle/lam-io-socket
tkill: f_kill = "/tmp/lam-swingle_at_swingle/lam-killfile"
tkill: nothing to kill: "/tmp/lam-swingle_at_swingle/lam-killfile"
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
n-1<6375> ssi:boot:open: opening
n-1<6375> ssi:boot:open: opening boot module globus
n-1<6375> ssi:boot:open: opened boot module globus
n-1<6375> ssi:boot:open: opening boot module rsh
n-1<6375> ssi:boot:open: opened boot module rsh
n-1<6375> ssi:boot:open: opening boot module slurm
n-1<6375> ssi:boot:open: opened boot module slurm
n-1<6375> ssi:boot:select: initializing boot module slurm
n-1<6375> ssi:boot:slurm: not running under SLURM
n-1<6375> ssi:boot:select: boot module not available: slurm
n-1<6375> ssi:boot:select: initializing boot module rsh
n-1<6375> ssi:boot:rsh: module initializing
n-1<6375> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<6375> ssi:boot:rsh:username: <same>
n-1<6375> ssi:boot:rsh:verbose: 1000
n-1<6375> ssi:boot:rsh:algorithm: linear
n-1<6375> ssi:boot:rsh:no_n: 0
n-1<6375> ssi:boot:rsh:no_profile: 0
n-1<6375> ssi:boot:rsh:fast: 0
n-1<6375> ssi:boot:rsh:ignore_stderr: 0
n-1<6375> ssi:boot:rsh:priority: 10
n-1<6375> ssi:boot:select: boot module available: rsh, priority: 10
n-1<6375> ssi:boot:select: initializing boot module globus
n-1<6375> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<6375> ssi:boot:select: boot module not available: globus
n-1<6375> ssi:boot:select: finalizing boot module slurm
n-1<6375> ssi:boot:slurm: finalizing
n-1<6375> ssi:boot:select: closing boot module slurm
n-1<6375> ssi:boot:select: finalizing boot module globus
n-1<6375> ssi:boot:globus: finalizing
n-1<6375> ssi:boot:select: closing boot module globus
n-1<6375> ssi:boot:select: selected boot module rsh
n-1<6375> ssi:boot:send_lamd: getting node ID from command line
n-1<6375> ssi:boot:send_lamd: getting agent haddr from command line
n-1<6375> ssi:boot:send_lamd: getting agent port from command line
n-1<6375> ssi:boot:send_lamd: getting node ID from command line
n-1<6375> ssi:boot:send_lamd: connecting to 192.168.3.1:32778, node id 0
n-1<6375> ssi:boot:send_lamd: sending dli_port 32794
[1] 6375 lamd -H 192.168.3.1 -P 32778 -n 0 -o 0 -d
n-1<6372> ssi:boot:rsh: successfully launched on n0 (swingle)
n-1<6372> ssi:boot:base:server: expecting connection from finite list
n-1<6372> ssi:boot:base:server: got connection from 192.168.3.1
n-1<6372> ssi:boot:base:server: this connection is expected (n0)
n-1<6372> ssi:boot:base:server: remote lamd is at 192.168.3.1:32794
n-1<6372> ssi:boot:base:linear: booting n1 (swingle3)
n-1<6372> ssi:boot:rsh: starting lamd on (swingle3)
n-1<6372> ssi:boot:rsh: starting on n1 (swingle3): hboot -t -c
lam-conf.lamd -d -s -I "-H 192.168.3.1 -P 32778 -n 1 -o 0"
n-1<6372> ssi:boot:rsh: launching remotely
n-1<6372> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
swingle3 -n 'echo $SHELL'
swingle_at_swingle3's password:
n-1<6372> ssi:boot:rsh: remote shell /bin/bash
n-1<6372> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
swingle3 -n hboot -t -c lam-conf.lamd -d -s -I '"-H 192.168.3.1 -P 32778
-n 1 -o 0"'
swingle_at_swingle3's password:
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-swingle_at_swingle3/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-swingle_at_swingle3/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-swingle_at_swingle3/lam-io-socket
tkill: f_kill = "/tmp/lam-swingle_at_swingle3/lam-killfile"
tkill: nothing to kill: "/tmp/lam-swingle_at_swingle3/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 5309 lamd -H 192.168.3.1 -P 32778 -n 1 -o 0 -d
n-1<6372> ssi:boot:rsh: successfully launched on n1 (swingle3)
n-1<6372> ssi:boot:base:server: expecting connection from finite list
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)
You can check these things by watching the output from "lamboot -d".
1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).
2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n-1<6372> ssi:boot:base:server: failed to connect to remote lamd!
n-1<6372> ssi:boot:base:server: closing server socket
n-1<6372> ssi:boot:base:linear: aborted!
n-1<6380> ssi:boot:open: opening
n-1<6380> ssi:boot:open: opening boot module globus
n-1<6380> ssi:boot:open: opened boot module globus
n-1<6380> ssi:boot:open: opening boot module rsh
n-1<6380> ssi:boot:open: opened boot module rsh
n-1<6380> ssi:boot:open: opening boot module slurm
n-1<6380> ssi:boot:open: opened boot module slurm
n-1<6380> ssi:boot:select: initializing boot module slurm
n-1<6380> ssi:boot:slurm: not running under SLURM
n-1<6380> ssi:boot:select: boot module not available: slurm
n-1<6380> ssi:boot:select: initializing boot module rsh
n-1<6380> ssi:boot:rsh: module initializing
n-1<6380> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<6380> ssi:boot:rsh:username: <same>
n-1<6380> ssi:boot:rsh:verbose: 1000
n-1<6380> ssi:boot:rsh:algorithm: linear
n-1<6380> ssi:boot:rsh:no_n: 0
n-1<6380> ssi:boot:rsh:no_profile: 0
n-1<6380> ssi:boot:rsh:fast: 0
n-1<6380> ssi:boot:rsh:ignore_stderr: 0
n-1<6380> ssi:boot:rsh:priority: 10
n-1<6380> ssi:boot:select: boot module available: rsh, priority: 10
n-1<6380> ssi:boot:select: initializing boot module globus
n-1<6380> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<6380> ssi:boot:select: boot module not available: globus
n-1<6380> ssi:boot:select: finalizing boot module slurm
n-1<6380> ssi:boot:slurm: finalizing
n-1<6380> ssi:boot:select: closing boot module slurm
n-1<6380> ssi:boot:select: finalizing boot module globus
n-1<6380> ssi:boot:globus: finalizing
n-1<6380> ssi:boot:select: closing boot module globus
n-1<6380> ssi:boot:select: selected boot module rsh
n-1<6380> ssi:boot:base: looking for boot schema in following directories:
n-1<6380> ssi:boot:base: <current directory>
n-1<6380> ssi:boot:base: $TROLLIUSHOME/etc
n-1<6380> ssi:boot:base: $LAMHOME/etc
n-1<6380> ssi:boot:base: /etc/lam
n-1<6380> ssi:boot:base: looking for boot schema file:
n-1<6380> ssi:boot:base: lam-bhost.def
n-1<6380> ssi:boot:base: found boot schema: lam-bhost.def
n-1<6380> ssi:boot:rsh: found the following hosts:
n-1<6380> ssi:boot:rsh: n0 swingle (cpu=1)
n-1<6380> ssi:boot:rsh: n1 swingle3 (cpu=1)
n-1<6380> ssi:boot:rsh: resolved hosts:
n-1<6380> ssi:boot:rsh: n0 swingle --> 192.168.3.1 (origin)
n-1<6380> ssi:boot:rsh: n1 swingle3 --> 192.168.3.3
n-1<6380> ssi:boot:rsh: starting RTE procs
n-1<6380> ssi:boot:base:linear: starting
n-1<6380> ssi:boot:base:linear: booting n0 (swingle)
n-1<6380> ssi:boot:rsh: starting wipe on (swingle)
n-1<6380> ssi:boot:rsh: starting on n0 (swingle): tkill -d
n-1<6380> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-swingle_at_swingle/lam-killfiletkill:
removing socket file ...
tkill: socket file: /tmp/lam-swingle_at_swingle/lam-kernel-socketdtkill:
removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-swingle_at_swingle/lam-io-socket
tkill: f_kill = "/tmp/lam-swingle_at_swingle/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 6375 ...
tkill: killed
tkill: all finished
n-1<6380> ssi:boot:rsh: successfully launched on n0 (swingle)
n-1<6380> ssi:boot:base:linear: booting n1 (swingle3)
n-1<6380> ssi:boot:rsh: starting wipe on (swingle3)
n-1<6380> ssi:boot:rsh: starting on n1 (swingle3): tkill -d
n-1<6380> ssi:boot:rsh: launching remotely
n-1<6380> ssi:boot:rsh: attempting to execute: /usr/bin/ssh -x -a
swingle3 -n 'echo $SHELL'
swingle_at_swingle3's password:
ERROR: LAM/MPI unexpectedly received the following on stderr:
Connection closed by 192.168.3.3
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "swingle3".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.
LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke "echo $SHELL" on the remote node.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
This usually indicates an authentication problem with the remote
agent, some other configuration type of error in your .cshrc or
.profile file, or you were unable to executable a command on the
remote node for some other reason. The following is a list of items
that you should check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
/usr/bin/ssh -x -a swingle3 -n 'echo $SHELL'
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<6380> ssi:boot:base:linear: Failed to boot n1 (swingle3)
n-1<6380> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
|