Hello,
We have a small linux cluster running Redhat 9. The front end node can
be reached
from outside world, but the rest of the nodes on the cluster can only be
accessed
through the front end node. We use LAM/MPI 7.0 with ssi tcp. Can we use
lam with
this kind of configuration ?
The lamboot command will fail with the following messages, please notice
the error
message in Red color, it gives a nonsense IP addres:
[liv-fe]</cmlocal/r3.00>% lamboot -d -x liv0.list
n0<27570> ssi:boot: Opening
n0<27570> ssi:boot: opening module globus
n0<27570> ssi:boot: initializing module globus
n0<27570> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n0<27570> ssi:boot: module not available: globus
n0<27570> ssi:boot: opening module rsh
n0<27570> ssi:boot: initializing module rsh
n0<27570> ssi:boot:rsh: module initializing
n0<27570> ssi:boot:rsh:agent: rsh
n0<27570> ssi:boot:rsh:username: <same>
n0<27570> ssi:boot:rsh:verbose: 1000
n0<27570> ssi:boot:rsh:algorithm: linear
n0<27570> ssi:boot:rsh:priority: 10
n0<27570> ssi:boot: module available: rsh, priority: 10
n0<27570> ssi:boot: finalizing module globus
n0<27570> ssi:boot:globus: finalizing
n0<27570> ssi:boot: closing module globus
n0<27570> ssi:boot: Selected boot module rsh
LAM 7.0/MPI 2 C++ - Indiana University
n0<27570> ssi:boot:base: looking for boot schema in following
directories:
n0<27570> ssi:boot:base: <current directory>
n0<27570> ssi:boot:base: $TROLLIUSHOME/etc
n0<27570> ssi:boot:base: $LAMHOME/etc
n0<27570> ssi:boot:base:
/cm/production/r3.00/ap/local/lam-7.0-pgs/LINUXM/etc
n0<27570> ssi:boot:base: looking for boot schema file:
n0<27570> ssi:boot:base: liv0.list
n0<27570> ssi:boot:base: found boot schema: liv0.list
n0<27570> ssi:boot:rsh: found the following hosts:
n0<27570> ssi:boot:rsh: n0 liv0 (cpu=1)
n0<27570> ssi:boot:rsh: n1 liv1 (cpu=1)
n0<27570> ssi:boot:rsh: resolved hosts:
n0<27570> ssi:boot:rsh: n0 liv0 --> 192.168.1.1 (origin)
n0<27570> ssi:boot:rsh: n1 liv1 --> 192.168.1.2
n0<27570> ssi:boot:rsh: starting RTE procs
n0<27570> ssi:boot:base:linear: starting
n0<27570> ssi:boot:base:server: opening server TCP socket
n0<27570> ssi:boot:base:server: opened port 54122
n0<27570> ssi:boot:base:linear: booting n0 (liv0)
n0<27570> ssi:boot:rsh: starting lamd on (liv0)
n0<27570> ssi:boot:rsh: starting on n0 (liv0): hboot -t -c lam-conf.lamd
-d -I -x -H 192.168.1.1 -P 54122 -n 0 -o 0
n0<27570> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-oroper_at_liv-fe/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-oroper_at_liv-fe/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-oroper_at_liv-fe/lam-io-socket
tkill: f_kill = "/tmp/lam-oroper_at_liv-fe/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 27475 ...
tkill: killed
tkill: all finished
hboot: booting...
hboot: fork /cm/production/r3.00/ap/local/lam-7/LINUXM/bin/lamd
hboot: attempting to execute
[1] 27573 lamd -x -H 192.168.1.1 -P 54122 -n 0 -o 0 -d
n0<27570> ssi:boot:rsh: successfully launched on n0 (liv0)
n0<27570> ssi:boot:base:server: expecting connection from finite list
n-1<27573> ssi:boot: Opening
n-1<27573> ssi:boot: opening module globus
n-1<27573> ssi:boot: initializing module globus
n-1<27573> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<27573> ssi:boot: module not available: globus
n-1<27573> ssi:boot: opening module rsh
n-1<27573> ssi:boot: initializing module rsh
n-1<27573> ssi:boot:rsh: module initializing
n-1<27573> ssi:boot:rsh:agent: rsh
n-1<27573> ssi:boot:rsh:username: <same>
n-1<27573> ssi:boot:rsh:verbose: 1000
n-1<27573> ssi:boot:rsh:algorithm: linear
n-1<27573> ssi:boot:rsh:priority: 10
n-1<27573> ssi:boot: module available: rsh, priority: 10
n-1<27573> ssi:boot: finalizing module globus
n-1<27573> ssi:boot:globus: finalizing
n-1<27573> ssi:boot: closing module globus
n-1<27573> ssi:boot: Selected boot module rsh
n0<27570> ssi:boot:base:server: got connection from 192.168.1.1
n0<27570> ssi:boot:base:server: this connection is expected (n0)
n0<27570> ssi:boot:base:server: remote lamd is at 192.168.1.1:44674
n0<27570> ssi:boot:base:linear: booting n1 (liv1)
n0<27570> ssi:boot:rsh: starting lamd on (liv1)
n0<27570> ssi:boot:rsh: starting on n1 (liv1): hboot -t -c lam-conf.lamd
-d -s -I "-x -H 192.168.1.1 -P 54122 -n 1 -o 0"
n0<27570> ssi:boot:rsh: launching remotely
n0<27570> ssi:boot:rsh: attempting to execute "rsh liv1 -n echo $SHELL"
n0<27570> ssi:boot:rsh: remote shell R3.00 login script is running.
/bin/csh
n0<27570> ssi:boot:rsh: attempting to execute "rsh liv1 -n hboot -t -c
lam-conf.lamd -d -s -I "-x -H 192.168.1.1 -P 54122 -n 1 -o 0""
R3.00 login script is running.
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-oroper_at_liv1/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-oroper_at_liv1/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-oroper_at_liv1/lam-io-socket
tkill: f_kill = "/tmp/lam-oroper_at_liv1/lam-killfile"
tkill: nothing to kill: "/tmp/lam-oroper_at_liv1/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /cm/production/r3.00/ap/local/lam-7/LINUXM/bin/lamd
[1] 23750 lamd -x -H 192.168.1.1 -P 54122 -n 1 -o 0 -d
n0<27570> ssi:boot:rsh: successfully launched on n1 (liv1)
n0<27570> ssi:boot:base:server: expecting connection from finite list
n0<27570> ssi:boot:base:server: got connection from 223.213.12.64
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)
You can check these things by watching the output from "lamboot -d".
1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).
2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n0<27570> ssi:boot:base:server: failed to connect to remote lamd!
n0<27570> ssi:boot:base:server: closing server socket
n0<27570> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
Synopsis: wipe [-dhHvV] [-nn] [-np] [-w <#>] [<bhost>]
Description: This command has been obsoleted by the "lamhalt"
command.
You should be using that instead. However, "wipe" can
still be used to shut down a LAM universe.
Options:
-b Use the faster wipe algorithm; will only work if shell
on all remote nodes is same as shell on local node
-d Print debugging message (implies -v)
-h Print this message
-H Don't print the header
-nn Don't add "-n" to the remote agent command line
-np Do not force the execution of $HOME/.profile on remote
hosts
-v Be verbose
-V Print version and exit without shutting down LAM
-w <#> Wipe the first <#> nodes
<bhost> Use <bhost> as the boot schema
-----------------------------------------------------------------------------
lamboot did NOT complete successfully
========================================= end of output from lamboot.
Any helps are greatly appreciated.
Thank you.
Lily
|