Hi
I am running two linux redhat 4es machines (cfd004, cfd005).
I want to use lam on these machines.
I have installed lam on both. rsh is also working fine
between two of them. There is no firewall active.
But when I lamboot from cfd005 to cfd004 it is working fine.
But when I lamboot from cfd004 to cfd005 it is giving me
error.
---------------------------------------------------------------------------------------------------------------------------------------------
lamboot -d hosts
n-1<23204> ssi:boot:open: opening
n-1<23204> ssi:boot:open: opening boot module globus
n-1<23204> ssi:boot:open: opened boot module globus
n-1<23204> ssi:boot:open: opening boot module rsh
n-1<23204> ssi:boot:open: opened boot module rsh
n-1<23204> ssi:boot:open: opening boot module slurm
n-1<23204> ssi:boot:open: opened boot module slurm
n-1<23204> ssi:boot:select: initializing boot module
slurm
n-1<23204> ssi:boot:slurm: not running under SLURM
n-1<23204> ssi:boot:select: boot module not available:
slurm
n-1<23204> ssi:boot:select: initializing boot module
globus
n-1<23204> ssi:boot:globus: globus-job-run not found,
globus boot will not run
n-1<23204> ssi:boot:select: boot module not available:
globus
n-1<23204> ssi:boot:select: initializing boot module
rsh
n-1<23204> ssi:boot:rsh: module initializing
n-1<23204> ssi:boot:rsh:agent: rsh
n-1<23204> ssi:boot:rsh:username: <same>
n-1<23204> ssi:boot:rsh:verbose: 1000
n-1<23204> ssi:boot:rsh:algorithm: linear
n-1<23204> ssi:boot:rsh:no_n: 0
n-1<23204> ssi:boot:rsh:no_profile: 0
n-1<23204> ssi:boot:rsh:fast: 0
n-1<23204> ssi:boot:rsh:ignore_stderr: 0
n-1<23204> ssi:boot:rsh:priority: 10
n-1<23204> ssi:boot:select: boot module available:
rsh, priority: 10
n-1<23204> ssi:boot:select: finalizing boot module
slurm
n-1<23204> ssi:boot:slurm: finalizing
n-1<23204> ssi:boot:select: closing boot module slurm
n-1<23204> ssi:boot:select: finalizing boot module
globus
n-1<23204> ssi:boot:globus: finalizing
n-1<23204> ssi:boot:select: closing boot module globus
n-1<23204> ssi:boot:select: selected boot module rsh
LAM 7.1.1/MPI 2 C++/ROMIO -
n-1<23204> ssi:boot:base: looking for boot schema in
following directories:
n-1<23204> ssi:boot:base: <current directory>
n-1<23204> ssi:boot:base:
$TROLLIUSHOME/etc
n-1<23204> ssi:boot:base: $LAMHOME/etc
n-1<23204> ssi:boot:base: /usr/local/etc
n-1<23204> ssi:boot:base: looking for boot schema
file:
n-1<23204> ssi:boot:base: hosts
n-1<23204> ssi:boot:base: found boot schema: hosts
n-1<23204> ssi:boot:rsh: found the following hosts:
n-1<23204> ssi:boot:rsh: n0 cfd004 (cpu=1)
n-1<23204> ssi:boot:rsh: n1 cfd005 (cpu=1)
n-1<23204> ssi:boot:rsh: resolved hosts:
n-1<23204> ssi:boot:rsh: n0 cfd004 -->
10.1.0.15 (origin)
n-1<23204> ssi:boot:rsh: n1 cfd005 -->
10.1.0.16
n-1<23204> ssi:boot:rsh: starting RTE procs
n-1<23204> ssi:boot:base:linear: starting
n-1<23204> ssi:boot:base:server: opening server TCP
socket
n-1<23204> ssi:boot:base:server: opened port 33462
n-1<23204> ssi:boot:base:linear: booting n0 (cfd004)
n-1<23204> ssi:boot:rsh: starting lamd on (cfd004)
n-1<23204> ssi:boot:rsh: starting on n0 (cfd004):
hboot -t -c lam-conf.lamd -d -
I -H 10.1.0.15 -P 33462 -n 0 -o 0
n-1<23204> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur@cfd004/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur@cfd004/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-kur@cfd004/lam-io-sockettkill: f_kill = "
/tmp/lam-kur@cfd004/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-kur@cfd004/lam-killfile"
hboot: booting...
hboot: fork /usr/local/bin/lamd
hboot: attempting to execute
[1] 23207 lamd -H 10.1.0.15 -P 33462 -n 0 -o 0 -d
n-1<23204> ssi:boot:rsh: successfully launched on n0
(cfd004)
n-1<23204> ssi:boot:base:server: expecting connection
from finite list
n-1<23207> ssi:boot:open: opening
n-1<23207> ssi:boot:open: opening boot module globus
n-1<23207> ssi:boot:open: opened boot module globus
n-1<23207> ssi:boot:open: opening boot module rsh
n-1<23207> ssi:boot:open: opened boot module rsh
n-1<23207> ssi:boot:open: opening boot module slurm
n-1<23207> ssi:boot:open: opened boot module slurm
n-1<23207> ssi:boot:select: initializing boot module
slurm
n-1<23207> ssi:boot:slurm: not running under SLURM
n-1<23207> ssi:boot:select: boot module not available:
slurm
n-1<23207> ssi:boot:select: initializing boot module
globus
n-1<23207> ssi:boot:globus: globus-job-run not found,
globus boot will not run
n-1<23207> ssi:boot:select: boot module not available:
globus
n-1<23207> ssi:boot:select: initializing boot module
rsh
n-1<23207> ssi:boot:rsh: module initializing
n-1<23207> ssi:boot:rsh:agent: rsh
n-1<23207> ssi:boot:rsh:username: <same>
n-1<23207> ssi:boot:rsh:verbose: 1000
n-1<23207> ssi:boot:rsh:algorithm: linear
n-1<23207> ssi:boot:rsh:no_n: 0
n-1<23207> ssi:boot:rsh:no_profile: 0
n-1<23207> ssi:boot:rsh:fast: 0
n-1<23207> ssi:boot:rsh:ignore_stderr: 0
n-1<23207> ssi:boot:rsh:priority: 10
n-1<23207> ssi:boot:select: boot module available:
rsh, priority: 10
n-1<23207> ssi:boot:select: finalizing boot module
slurm
n-1<23207> ssi:boot:slurm: finalizing
n-1<23207> ssi:boot:select: closing boot module slurm
n-1<23207> ssi:boot:select: finalizing boot module
globus
n-1<23207> ssi:boot:globus: finalizing
n-1<23207> ssi:boot:select: closing boot module globus
n-1<23207> ssi:boot:select: selected boot module rsh
n-1<23207> ssi:boot:send_lamd: getting node ID from
command line
n-1<23207> ssi:boot:send_lamd: getting agent haddr
from command line
n-1<23207> ssi:boot:send_lamd: getting agent port from
command line
n-1<23207> ssi:boot:send_lamd: getting node ID from
command line
n-1<23207> ssi:boot:send_lamd: connecting to
10.1.0.15:33462, node id 0
n-1<23207> ssi:boot:send_lamd: sending dli_port 33462
n-1<23204> ssi:boot:base:server: got connection from
10.1.0.15
n-1<23204> ssi:boot:base:server: this connection is
expected (n0)
n-1<23204> ssi:boot:base:server: remote lamd is at
10.1.0.15:33462
n-1<23204> ssi:boot:base:linear: booting n1 (cfd005)
n-1<23204> ssi:boot:rsh: starting lamd on (cfd005)
n-1<23204> ssi:boot:rsh: starting on n1 (cfd005):
hboot -t -c lam-conf.lamd -d -
s -I "-H 10.1.0.15 -P 33462 -n 1 -o 0"
n-1<23204> ssi:boot:rsh: launching remotely
n-1<23204> ssi:boot:rsh: attempting to execute: rsh
cfd005 -n 'echo $SHELL'
n-1<23204> ssi:boot:rsh: remote shell /bin/bash
n-1<23204> ssi:boot:rsh: attempting to execute: rsh
cfd005 -n hboot -t -c lam-co
nf.lamd -d -s -I '"-H 10.1.0.15 -P 33462 -n 1 -o
0"'
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur@cfd005/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur@cfd005/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-kur@cfd005/lam-io-sockettkill: f_kill = "
/tmp/lam-kur@cfd005/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-kur@cfd005/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 24059 lamd -H 10.1.0.15 -P 33462 -n 1 -o 0 -d
n-1<23204> ssi:boot:rsh: successfully launched on n1
(cfd005)
n-1<23204> ssi:boot:base:server: expecting connection
from finite list
n-1<23204> ssi:boot:base:server: got connection from
10.1.0.16
n-1<23204> ssi:boot:base:server: this connection is
expected (n1)
-----------------------------------------------------------------------------
The lamboot agent failed to read a message over a socket
from the
newly-booted process. This should not happen
(especially since TCP is
a guaranteed protocol).
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS,
AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE
LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE
LAM/MPI USER'S
*** MAILING LIST.
You should probably check the following:
- Network connectivity: Ensure that messages can be passed
reliably
over TCP using random ports.
- Environment / PATH settings: Ensure that you are running
the same
version of LAM/MPI on all nodes. Sometimes
premature disconnects
(and therefore this error message) may be caused if
mismatched
versions of LAM are used on different nodes.
- Node health: Ensure that the host where the newly-booted
process was
launched is healthy and still available on the
network.
-----------------------------------------------------------------------------
n-1<23204> ssi:boot:base:server: failed to connect to
remote lamd!
n-1<23204> ssi:boot:base:server: closing server socket
n-1<23204> ssi:boot:base:linear: aborted!
n-1<23212> ssi:boot:open: opening
n-1<23212> ssi:boot:open: opening boot module globus
n-1<23212> ssi:boot:open: opened boot module globus
n-1<23212> ssi:boot:open: opening boot module rsh
n-1<23212> ssi:boot:open: opened boot module rsh
n-1<23212> ssi:boot:open: opening boot module slurm
n-1<23212> ssi:boot:open: opened boot module slurm
n-1<23212> ssi:boot:select: initializing boot module
globus
n-1<23212> ssi:boot:globus: globus-job-run not found,
globus boot will not run
n-1<23212> ssi:boot:select: boot module not available:
globus
n-1<23212> ssi:boot:select: initializing boot module
rsh
n-1<23212> ssi:boot:rsh: module initializing
n-1<23212> ssi:boot:rsh:agent: rsh
n-1<23212> ssi:boot:rsh:username: <same>
n-1<23212> ssi:boot:rsh:verbose: 1000
n-1<23212> ssi:boot:rsh:algorithm: linear
n-1<23212> ssi:boot:rsh:no_n: 0
n-1<23212> ssi:boot:rsh:no_profile: 0
n-1<23212> ssi:boot:rsh:fast: 0
n-1<23212> ssi:boot:rsh:ignore_stderr: 0
n-1<23212> ssi:boot:rsh:priority: 10
n-1<23212> ssi:boot:select: boot module available:
rsh, priority: 10
n-1<23212> ssi:boot:select: initializing boot module
slurm
n-1<23212> ssi:boot:slurm: not running under SLURM
n-1<23212> ssi:boot:select: boot module not available:
slurm
n-1<23212> ssi:boot:select: finalizing boot module
globus
n-1<23212> ssi:boot:globus: finalizing
n-1<23212> ssi:boot:select: closing boot module globus
n-1<23212> ssi:boot:select: finalizing boot module
slurm
n-1<23212> ssi:boot:slurm: finalizing
n-1<23212> ssi:boot:select: closing boot module slurm
n-1<23212> ssi:boot:select: selected boot module rsh
n-1<23212> ssi:boot:base: looking for boot schema in
following directories:
n-1<23212> ssi:boot:base: <current
directory>
n-1<23212> ssi:boot:base:
$TROLLIUSHOME/etc
n-1<23212> ssi:boot:base: $LAMHOME/etc
n-1<23212> ssi:boot:base: /usr/local/etc
n-1<23212> ssi:boot:base: looking for boot schema
file:
n-1<23212> ssi:boot:base: hosts
n-1<23212> ssi:boot:base: found boot schema: hosts
n-1<23212> ssi:boot:rsh: found the following hosts:
n-1<23212> ssi:boot:rsh: n0 cfd004 (cpu=1)
n-1<23212> ssi:boot:rsh: n1 cfd005 (cpu=1)
n-1<23212> ssi:boot:rsh: resolved hosts:
n-1<23212> ssi:boot:rsh: n0 cfd004 -->
10.1.0.15 (origin)
n-1<23212> ssi:boot:rsh: n1 cfd005 -->
10.1.0.16
n-1<23212> ssi:boot:rsh: starting RTE procs
n-1<23212> ssi:boot:base:linear: starting
n-1<23212> ssi:boot:base:linear: booting n0 (cfd004)
n-1<23212> ssi:boot:rsh: starting wipe on (cfd004)
n-1<23212> ssi:boot:rsh: starting on n0 (cfd004):
tkill -d
n-1<23212> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur@cfd004/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur@cfd004/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-kur@cfd004/lam-io-sockettkill:
f_kill = "
/tmp/lam-kur@cfd004/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 23207 ...
tkill: killed
tkill: all finished
n-1<23212> ssi:boot:rsh: successfully launched on n0
(cfd004)
n-1<23212> ssi:boot:base:linear: booting n1 (cfd005)
n-1<23212> ssi:boot:rsh: starting wipe on (cfd005)
n-1<23212> ssi:boot:rsh: starting on n1 (cfd005):
tkill -d
n-1<23212> ssi:boot:rsh: launching remotely
n-1<23212> ssi:boot:rsh: attempting to execute: rsh
cfd005 -n 'echo $SHELL'
n-1<23212> ssi:boot:rsh: remote shell /bin/bash
n-1<23212> ssi:boot:rsh: attempting to execute: rsh
cfd005 -n tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur@cfd005/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur@cfd005/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-kur@cfd005/lam-io-sockettkill: f_kill = "
/tmp/lam-kur@cfd005/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 24059 ...
tkill: killed
tkill: all finished
n-1<23212> ssi:boot:rsh: successfully launched on n1
(cfd005)
n-1<23212> ssi:boot:base:linear: finished
n-1<23212> ssi:boot:rsh: all RTE procs started
n-1<23212> ssi:boot:rsh: finalizing
n-1<23212> ssi:boot: Closing
lamboot did NOT complete successfully
[kur@cfd004 kur]$
---------------------------------------------------------------------------------------------------------------------------------------
Regards
Kuriyan
DISCLAIMER:
This message,including any attachments contains confidential and privileged information for the sole use of the intended recipient(s), and is protected by law. If you are not the intended recipient, please destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. Bajaj Auto reserves the right to record, monitor, and inspect all email communications through its internal and external networks. Your messages shall be subject to such lawful supervision as Bajaj Auto deems necessary in order to protect its information, interests and reputation. Bajaj Auto prohibits and takes steps to prevent its information systems from being used to view, store or forward offensive or discriminatory material. If this message contains such material, please report it to abuse@bajajauto.co.in.