Hi
I am running two linux redhat 4es machines (cfd004, cfd005). I want to
use lam on these machines.
I have installed lam on both. rsh is also working fine between two of
them. There is no firewall active.
But when I lamboot from cfd005 to cfd004 it is working fine.
But when I lamboot from cfd004 to cfd005 it is giving me error.
------------------------------------------------------------------------
---------------------------------------------------------------------
lamboot -d hosts
n-1<23204> ssi:boot:open: opening
n-1<23204> ssi:boot:open: opening boot module globus
n-1<23204> ssi:boot:open: opened boot module globus
n-1<23204> ssi:boot:open: opening boot module rsh
n-1<23204> ssi:boot:open: opened boot module rsh
n-1<23204> ssi:boot:open: opening boot module slurm
n-1<23204> ssi:boot:open: opened boot module slurm
n-1<23204> ssi:boot:select: initializing boot module slurm
n-1<23204> ssi:boot:slurm: not running under SLURM
n-1<23204> ssi:boot:select: boot module not available: slurm
n-1<23204> ssi:boot:select: initializing boot module globus
n-1<23204> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<23204> ssi:boot:select: boot module not available: globus
n-1<23204> ssi:boot:select: initializing boot module rsh
n-1<23204> ssi:boot:rsh: module initializing
n-1<23204> ssi:boot:rsh:agent: rsh
n-1<23204> ssi:boot:rsh:username: <same>
n-1<23204> ssi:boot:rsh:verbose: 1000
n-1<23204> ssi:boot:rsh:algorithm: linear
n-1<23204> ssi:boot:rsh:no_n: 0
n-1<23204> ssi:boot:rsh:no_profile: 0
n-1<23204> ssi:boot:rsh:fast: 0
n-1<23204> ssi:boot:rsh:ignore_stderr: 0
n-1<23204> ssi:boot:rsh:priority: 10
n-1<23204> ssi:boot:select: boot module available: rsh, priority: 10
n-1<23204> ssi:boot:select: finalizing boot module slurm
n-1<23204> ssi:boot:slurm: finalizing
n-1<23204> ssi:boot:select: closing boot module slurm
n-1<23204> ssi:boot:select: finalizing boot module globus
n-1<23204> ssi:boot:globus: finalizing
n-1<23204> ssi:boot:select: closing boot module globus
n-1<23204> ssi:boot:select: selected boot module rsh
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<23204> ssi:boot:base: looking for boot schema in following
directories:
n-1<23204> ssi:boot:base: <current directory>
n-1<23204> ssi:boot:base: $TROLLIUSHOME/etc
n-1<23204> ssi:boot:base: $LAMHOME/etc
n-1<23204> ssi:boot:base: /usr/local/etc
n-1<23204> ssi:boot:base: looking for boot schema file:
n-1<23204> ssi:boot:base: hosts
n-1<23204> ssi:boot:base: found boot schema: hosts
n-1<23204> ssi:boot:rsh: found the following hosts:
n-1<23204> ssi:boot:rsh: n0 cfd004 (cpu=1)
n-1<23204> ssi:boot:rsh: n1 cfd005 (cpu=1)
n-1<23204> ssi:boot:rsh: resolved hosts:
n-1<23204> ssi:boot:rsh: n0 cfd004 --> 10.1.0.15 (origin)
n-1<23204> ssi:boot:rsh: n1 cfd005 --> 10.1.0.16
n-1<23204> ssi:boot:rsh: starting RTE procs
n-1<23204> ssi:boot:base:linear: starting
n-1<23204> ssi:boot:base:server: opening server TCP socket
n-1<23204> ssi:boot:base:server: opened port 33462
n-1<23204> ssi:boot:base:linear: booting n0 (cfd004)
n-1<23204> ssi:boot:rsh: starting lamd on (cfd004)
n-1<23204> ssi:boot:rsh: starting on n0 (cfd004): hboot -t -c
lam-conf.lamd -d -
I -H 10.1.0.15 -P 33462 -n 0 -o 0
n-1<23204> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur_at_cfd004/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur_at_cfd004/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-kur_at_cfd004/lam-io-sockettkill:
f_kill = "
/tmp/lam-kur_at_cfd004/lam-killfile"
tkill: nothing to kill: "/tmp/lam-kur_at_cfd004/lam-killfile"
hboot: booting...
hboot: fork /usr/local/bin/lamd
hboot: attempting to execute
[1] 23207 lamd -H 10.1.0.15 -P 33462 -n 0 -o 0 -d
n-1<23204> ssi:boot:rsh: successfully launched on n0 (cfd004)
n-1<23204> ssi:boot:base:server: expecting connection from finite list
n-1<23207> ssi:boot:open: opening
n-1<23207> ssi:boot:open: opening boot module globus
n-1<23207> ssi:boot:open: opened boot module globus
n-1<23207> ssi:boot:open: opening boot module rsh
n-1<23207> ssi:boot:open: opened boot module rsh
n-1<23207> ssi:boot:open: opening boot module slurm
n-1<23207> ssi:boot:open: opened boot module slurm
n-1<23207> ssi:boot:select: initializing boot module slurm
n-1<23207> ssi:boot:slurm: not running under SLURM
n-1<23207> ssi:boot:select: boot module not available: slurm
n-1<23207> ssi:boot:select: initializing boot module globus
n-1<23207> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<23207> ssi:boot:select: boot module not available: globus
n-1<23207> ssi:boot:select: initializing boot module rsh
n-1<23207> ssi:boot:rsh: module initializing
n-1<23207> ssi:boot:rsh:agent: rsh
n-1<23207> ssi:boot:rsh:username: <same>
n-1<23207> ssi:boot:rsh:verbose: 1000
n-1<23207> ssi:boot:rsh:algorithm: linear
n-1<23207> ssi:boot:rsh:no_n: 0
n-1<23207> ssi:boot:rsh:no_profile: 0
n-1<23207> ssi:boot:rsh:fast: 0
n-1<23207> ssi:boot:rsh:ignore_stderr: 0
n-1<23207> ssi:boot:rsh:priority: 10
n-1<23207> ssi:boot:select: boot module available: rsh, priority: 10
n-1<23207> ssi:boot:select: finalizing boot module slurm
n-1<23207> ssi:boot:slurm: finalizing
n-1<23207> ssi:boot:select: closing boot module slurm
n-1<23207> ssi:boot:select: finalizing boot module globus
n-1<23207> ssi:boot:globus: finalizing
n-1<23207> ssi:boot:select: closing boot module globus
n-1<23207> ssi:boot:select: selected boot module rsh
n-1<23207> ssi:boot:send_lamd: getting node ID from command line
n-1<23207> ssi:boot:send_lamd: getting agent haddr from command line
n-1<23207> ssi:boot:send_lamd: getting agent port from command line
n-1<23207> ssi:boot:send_lamd: getting node ID from command line
n-1<23207> ssi:boot:send_lamd: connecting to 10.1.0.15:33462, node id 0
n-1<23207> ssi:boot:send_lamd: sending dli_port 33462
n-1<23204> ssi:boot:base:server: got connection from 10.1.0.15
n-1<23204> ssi:boot:base:server: this connection is expected (n0)
n-1<23204> ssi:boot:base:server: remote lamd is at 10.1.0.15:33462
n-1<23204> ssi:boot:base:linear: booting n1 (cfd005)
n-1<23204> ssi:boot:rsh: starting lamd on (cfd005)
n-1<23204> ssi:boot:rsh: starting on n1 (cfd005): hboot -t -c
lam-conf.lamd -d -
s -I "-H 10.1.0.15 -P 33462 -n 1 -o 0"
n-1<23204> ssi:boot:rsh: launching remotely
n-1<23204> ssi:boot:rsh: attempting to execute: rsh cfd005 -n 'echo
$SHELL'
n-1<23204> ssi:boot:rsh: remote shell /bin/bash
n-1<23204> ssi:boot:rsh: attempting to execute: rsh cfd005 -n hboot -t
-c lam-co
nf.lamd -d -s -I '"-H 10.1.0.15 -P 33462 -n 1 -o 0"'
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur_at_cfd005/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur_at_cfd005/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-kur_at_cfd005/lam-io-sockettkill:
f_kill = "
/tmp/lam-kur_at_cfd005/lam-killfile"
tkill: nothing to kill: "/tmp/lam-kur_at_cfd005/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 24059 lamd -H 10.1.0.15 -P 33462 -n 1 -o 0 -d
n-1<23204> ssi:boot:rsh: successfully launched on n1 (cfd005)
n-1<23204> ssi:boot:base:server: expecting connection from finite list
n-1<23204> ssi:boot:base:server: got connection from 10.1.0.16
n-1<23204> ssi:boot:base:server: this connection is expected (n1)
------------------------------------------------------------------------
-----
The lamboot agent failed to read a message over a socket from the
newly-booted process. This should not happen (especially since TCP is
a guaranteed protocol).
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
You should probably check the following:
- Network connectivity: Ensure that messages can be passed reliably
over TCP using random ports.
- Environment / PATH settings: Ensure that you are running the same
version of LAM/MPI on all nodes. Sometimes premature disconnects
(and therefore this error message) may be caused if mismatched
versions of LAM are used on different nodes.
- Node health: Ensure that the host where the newly-booted process was
launched is healthy and still available on the network.
------------------------------------------------------------------------
-----
n-1<23204> ssi:boot:base:server: failed to connect to remote lamd!
n-1<23204> ssi:boot:base:server: closing server socket
n-1<23204> ssi:boot:base:linear: aborted!
n-1<23212> ssi:boot:open: opening
n-1<23212> ssi:boot:open: opening boot module globus
n-1<23212> ssi:boot:open: opened boot module globus
n-1<23212> ssi:boot:open: opening boot module rsh
n-1<23212> ssi:boot:open: opened boot module rsh
n-1<23212> ssi:boot:open: opening boot module slurm
n-1<23212> ssi:boot:open: opened boot module slurm
n-1<23212> ssi:boot:select: initializing boot module globus
n-1<23212> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<23212> ssi:boot:select: boot module not available: globus
n-1<23212> ssi:boot:select: initializing boot module rsh
n-1<23212> ssi:boot:rsh: module initializing
n-1<23212> ssi:boot:rsh:agent: rsh
n-1<23212> ssi:boot:rsh:username: <same>
n-1<23212> ssi:boot:rsh:verbose: 1000
n-1<23212> ssi:boot:rsh:algorithm: linear
n-1<23212> ssi:boot:rsh:no_n: 0
n-1<23212> ssi:boot:rsh:no_profile: 0
n-1<23212> ssi:boot:rsh:fast: 0
n-1<23212> ssi:boot:rsh:ignore_stderr: 0
n-1<23212> ssi:boot:rsh:priority: 10
n-1<23212> ssi:boot:select: boot module available: rsh, priority: 10
n-1<23212> ssi:boot:select: initializing boot module slurm
n-1<23212> ssi:boot:slurm: not running under SLURM
n-1<23212> ssi:boot:select: boot module not available: slurm
n-1<23212> ssi:boot:select: finalizing boot module globus
n-1<23212> ssi:boot:globus: finalizing
n-1<23212> ssi:boot:select: closing boot module globus
n-1<23212> ssi:boot:select: finalizing boot module slurm
n-1<23212> ssi:boot:slurm: finalizing
n-1<23212> ssi:boot:select: closing boot module slurm
n-1<23212> ssi:boot:select: selected boot module rsh
n-1<23212> ssi:boot:base: looking for boot schema in following
directories:
n-1<23212> ssi:boot:base: <current directory>
n-1<23212> ssi:boot:base: $TROLLIUSHOME/etc
n-1<23212> ssi:boot:base: $LAMHOME/etc
n-1<23212> ssi:boot:base: /usr/local/etc
n-1<23212> ssi:boot:base: looking for boot schema file:
n-1<23212> ssi:boot:base: hosts
n-1<23212> ssi:boot:base: found boot schema: hosts
n-1<23212> ssi:boot:rsh: found the following hosts:
n-1<23212> ssi:boot:rsh: n0 cfd004 (cpu=1)
n-1<23212> ssi:boot:rsh: n1 cfd005 (cpu=1)
n-1<23212> ssi:boot:rsh: resolved hosts:
n-1<23212> ssi:boot:rsh: n0 cfd004 --> 10.1.0.15 (origin)
n-1<23212> ssi:boot:rsh: n1 cfd005 --> 10.1.0.16
n-1<23212> ssi:boot:rsh: starting RTE procs
n-1<23212> ssi:boot:base:linear: starting
n-1<23212> ssi:boot:base:linear: booting n0 (cfd004)
n-1<23212> ssi:boot:rsh: starting wipe on (cfd004)
n-1<23212> ssi:boot:rsh: starting on n0 (cfd004): tkill -d
n-1<23212> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur_at_cfd004/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur_at_cfd004/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-kur_at_cfd004/lam-io-sockettkill:
f_kill = "
/tmp/lam-kur_at_cfd004/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 23207 ...
tkill: killed
tkill: all finished
n-1<23212> ssi:boot:rsh: successfully launched on n0 (cfd004)
n-1<23212> ssi:boot:base:linear: booting n1 (cfd005)
n-1<23212> ssi:boot:rsh: starting wipe on (cfd005)
n-1<23212> ssi:boot:rsh: starting on n1 (cfd005): tkill -d
n-1<23212> ssi:boot:rsh: launching remotely
n-1<23212> ssi:boot:rsh: attempting to execute: rsh cfd005 -n 'echo
$SHELL'
n-1<23212> ssi:boot:rsh: remote shell /bin/bash
n-1<23212> ssi:boot:rsh: attempting to execute: rsh cfd005 -n tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-kur_at_cfd005/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-kur_at_cfd005/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-kur_at_cfd005/lam-io-sockettkill:
f_kill = "
/tmp/lam-kur_at_cfd005/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 24059 ...
tkill: killed
tkill: all finished
n-1<23212> ssi:boot:rsh: successfully launched on n1 (cfd005)
n-1<23212> ssi:boot:base:linear: finished
n-1<23212> ssi:boot:rsh: all RTE procs started
n-1<23212> ssi:boot:rsh: finalizing
n-1<23212> ssi:boot: Closing
lamboot did NOT complete successfully
[kur_at_cfd004 kur]$
------------------------------------------------------------------------
---------------------------------------------------------------
Regards
Kuriyan
DISCLAIMER:
This message,including any attachments contains confidential and privileged information for the sole use of the intended recipient(s), and is protected by law. If you are not the intended recipient, please destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Bajaj Auto reserves the right to record, monitor, and inspect all email communications through its internal and external networks. Your messages shall be subject to such lawful supervision as Bajaj Auto deems necessary in order to protect its information, interests and reputation. Bajaj Auto prohibits and takes steps to prevent its information systems from being used to view, store or forward offensive or discriminatory material. If this message contains such material, please report it to abuse_at_bajajauto.co.in.
|