Did you try all the things suggested in the error message?
On Nov 3, 2007, at 6:50 AM, KURIYAN ARIMBOOR wrote:
> Hi
> I am running two linux redhat 4es machines (cfd004, cfd005). I want
> to use lam on these machines.
> I have installed lam on both. rsh is also working fine between two
> of them. There is no firewall active.
> But when I lamboot from cfd005 to cfd004 it is working fine.
> But when I lamboot from cfd004 to cfd005 it is giving me error.
>
> ---------------------------------------------------------------------------------------------------------------------------------------------
> lamboot -d hosts
> n-1<23204> ssi:boot:open: opening
> n-1<23204> ssi:boot:open: opening boot module globus
> n-1<23204> ssi:boot:open: opened boot module globus
> n-1<23204> ssi:boot:open: opening boot module rsh
> n-1<23204> ssi:boot:open: opened boot module rsh
> n-1<23204> ssi:boot:open: opening boot module slurm
> n-1<23204> ssi:boot:open: opened boot module slurm
> n-1<23204> ssi:boot:select: initializing boot module slurm
> n-1<23204> ssi:boot:slurm: not running under SLURM
> n-1<23204> ssi:boot:select: boot module not available: slurm
> n-1<23204> ssi:boot:select: initializing boot module globus
> n-1<23204> ssi:boot:globus: globus-job-run not found, globus boot
> will not run
> n-1<23204> ssi:boot:select: boot module not available: globus
> n-1<23204> ssi:boot:select: initializing boot module rsh
> n-1<23204> ssi:boot:rsh: module initializing
> n-1<23204> ssi:boot:rsh:agent: rsh
> n-1<23204> ssi:boot:rsh:username: <same>
> n-1<23204> ssi:boot:rsh:verbose: 1000
> n-1<23204> ssi:boot:rsh:algorithm: linear
> n-1<23204> ssi:boot:rsh:no_n: 0
> n-1<23204> ssi:boot:rsh:no_profile: 0
> n-1<23204> ssi:boot:rsh:fast: 0
> n-1<23204> ssi:boot:rsh:ignore_stderr: 0
> n-1<23204> ssi:boot:rsh:priority: 10
> n-1<23204> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<23204> ssi:boot:select: finalizing boot module slurm
> n-1<23204> ssi:boot:slurm: finalizing
> n-1<23204> ssi:boot:select: closing boot module slurm
> n-1<23204> ssi:boot:select: finalizing boot module globus
> n-1<23204> ssi:boot:globus: finalizing
> n-1<23204> ssi:boot:select: closing boot module globus
> n-1<23204> ssi:boot:select: selected boot module rsh
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<23204> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<23204> ssi:boot:base: <current directory>
> n-1<23204> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<23204> ssi:boot:base: $LAMHOME/etc
> n-1<23204> ssi:boot:base: /usr/local/etc
> n-1<23204> ssi:boot:base: looking for boot schema file:
> n-1<23204> ssi:boot:base: hosts
> n-1<23204> ssi:boot:base: found boot schema: hosts
> n-1<23204> ssi:boot:rsh: found the following hosts:
> n-1<23204> ssi:boot:rsh: n0 cfd004 (cpu=1)
> n-1<23204> ssi:boot:rsh: n1 cfd005 (cpu=1)
> n-1<23204> ssi:boot:rsh: resolved hosts:
> n-1<23204> ssi:boot:rsh: n0 cfd004 --> 10.1.0.15 (origin)
> n-1<23204> ssi:boot:rsh: n1 cfd005 --> 10.1.0.16
> n-1<23204> ssi:boot:rsh: starting RTE procs
> n-1<23204> ssi:boot:base:linear: starting
> n-1<23204> ssi:boot:base:server: opening server TCP socket
> n-1<23204> ssi:boot:base:server: opened port 33462
> n-1<23204> ssi:boot:base:linear: booting n0 (cfd004)
> n-1<23204> ssi:boot:rsh: starting lamd on (cfd004)
> n-1<23204> ssi:boot:rsh: starting on n0 (cfd004): hboot -t -c lam-
> conf.lamd -d -
> I -H 10.1.0.15 -P 33462 -n 0 -o 0
> n-1<23204> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-kur_at_cfd004/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-kur_at_cfd004/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-kur_at_cfd004/lam-io-
> sockettkill: f_kill = "
> /tmp/lam-kur_at_cfd004/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-kur_at_cfd004/lam-killfile"
> hboot: booting...
> hboot: fork /usr/local/bin/lamd
> hboot: attempting to execute
> [1] 23207 lamd -H 10.1.0.15 -P 33462 -n 0 -o 0 -d
> n-1<23204> ssi:boot:rsh: successfully launched on n0 (cfd004)
> n-1<23204> ssi:boot:base:server: expecting connection from finite list
> n-1<23207> ssi:boot:open: opening
> n-1<23207> ssi:boot:open: opening boot module globus
> n-1<23207> ssi:boot:open: opened boot module globus
> n-1<23207> ssi:boot:open: opening boot module rsh
> n-1<23207> ssi:boot:open: opened boot module rsh
> n-1<23207> ssi:boot:open: opening boot module slurm
> n-1<23207> ssi:boot:open: opened boot module slurm
> n-1<23207> ssi:boot:select: initializing boot module slurm
> n-1<23207> ssi:boot:slurm: not running under SLURM
> n-1<23207> ssi:boot:select: boot module not available: slurm
> n-1<23207> ssi:boot:select: initializing boot module globus
> n-1<23207> ssi:boot:globus: globus-job-run not found, globus boot
> will not run
> n-1<23207> ssi:boot:select: boot module not available: globus
> n-1<23207> ssi:boot:select: initializing boot module rsh
> n-1<23207> ssi:boot:rsh: module initializing
> n-1<23207> ssi:boot:rsh:agent: rsh
> n-1<23207> ssi:boot:rsh:username: <same>
> n-1<23207> ssi:boot:rsh:verbose: 1000
> n-1<23207> ssi:boot:rsh:algorithm: linear
> n-1<23207> ssi:boot:rsh:no_n: 0
> n-1<23207> ssi:boot:rsh:no_profile: 0
> n-1<23207> ssi:boot:rsh:fast: 0
> n-1<23207> ssi:boot:rsh:ignore_stderr: 0
> n-1<23207> ssi:boot:rsh:priority: 10
> n-1<23207> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<23207> ssi:boot:select: finalizing boot module slurm
> n-1<23207> ssi:boot:slurm: finalizing
> n-1<23207> ssi:boot:select: closing boot module slurm
> n-1<23207> ssi:boot:select: finalizing boot module globus
> n-1<23207> ssi:boot:globus: finalizing
> n-1<23207> ssi:boot:select: closing boot module globus
> n-1<23207> ssi:boot:select: selected boot module rsh
> n-1<23207> ssi:boot:send_lamd: getting node ID from command line
> n-1<23207> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<23207> ssi:boot:send_lamd: getting agent port from command line
> n-1<23207> ssi:boot:send_lamd: getting node ID from command line
> n-1<23207> ssi:boot:send_lamd: connecting to 10.1.0.15:33462, node
> id 0
> n-1<23207> ssi:boot:send_lamd: sending dli_port 33462
> n-1<23204> ssi:boot:base:server: got connection from 10.1.0.15
> n-1<23204> ssi:boot:base:server: this connection is expected (n0)
> n-1<23204> ssi:boot:base:server: remote lamd is at 10.1.0.15:33462
> n-1<23204> ssi:boot:base:linear: booting n1 (cfd005)
> n-1<23204> ssi:boot:rsh: starting lamd on (cfd005)
> n-1<23204> ssi:boot:rsh: starting on n1 (cfd005): hboot -t -c lam-
> conf.lamd -d -
> s -I "-H 10.1.0.15 -P 33462 -n 1 -o 0"
> n-1<23204> ssi:boot:rsh: launching remotely
> n-1<23204> ssi:boot:rsh: attempting to execute: rsh cfd005 -n 'echo
> $SHELL'
> n-1<23204> ssi:boot:rsh: remote shell /bin/bash
> n-1<23204> ssi:boot:rsh: attempting to execute: rsh cfd005 -n hboot -
> t -c lam-co
> nf.lamd -d -s -I '"-H 10.1.0.15 -P 33462 -n 1 -o 0"'
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-kur_at_cfd005/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-kur_at_cfd005/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-kur_at_cfd005/lam-io-
> sockettkill: f_kill = "
> /tmp/lam-kur_at_cfd005/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-kur_at_cfd005/lam-killfile"
> hboot: performing tkill
> hboot: tkill -d
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 24059 lamd -H 10.1.0.15 -P 33462 -n 1 -o 0 -d
> n-1<23204> ssi:boot:rsh: successfully launched on n1 (cfd005)
> n-1<23204> ssi:boot:base:server: expecting connection from finite list
> n-1<23204> ssi:boot:base:server: got connection from 10.1.0.16
> n-1<23204> ssi:boot:base:server: this connection is expected (n1)
> -----------------------------------------------------------------------------
> The lamboot agent failed to read a message over a socket from the
> newly-booted process. This should not happen (especially since TCP is
> a guaranteed protocol).
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> You should probably check the following:
>
> - Network connectivity: Ensure that messages can be passed reliably
> over TCP using random ports.
> - Environment / PATH settings: Ensure that you are running the same
> version of LAM/MPI on all nodes. Sometimes premature disconnects
> (and therefore this error message) may be caused if mismatched
> versions of LAM are used on different nodes.
> - Node health: Ensure that the host where the newly-booted process was
> launched is healthy and still available on the network.
> -----------------------------------------------------------------------------
> n-1<23204> ssi:boot:base:server: failed to connect to remote lamd!
> n-1<23204> ssi:boot:base:server: closing server socket
> n-1<23204> ssi:boot:base:linear: aborted!
> n-1<23212> ssi:boot:open: opening
> n-1<23212> ssi:boot:open: opening boot module globus
> n-1<23212> ssi:boot:open: opened boot module globus
> n-1<23212> ssi:boot:open: opening boot module rsh
> n-1<23212> ssi:boot:open: opened boot module rsh
> n-1<23212> ssi:boot:open: opening boot module slurm
> n-1<23212> ssi:boot:open: opened boot module slurm
> n-1<23212> ssi:boot:select: initializing boot module globus
> n-1<23212> ssi:boot:globus: globus-job-run not found, globus boot
> will not run
> n-1<23212> ssi:boot:select: boot module not available: globus
> n-1<23212> ssi:boot:select: initializing boot module rsh
> n-1<23212> ssi:boot:rsh: module initializing
> n-1<23212> ssi:boot:rsh:agent: rsh
> n-1<23212> ssi:boot:rsh:username: <same>
> n-1<23212> ssi:boot:rsh:verbose: 1000
> n-1<23212> ssi:boot:rsh:algorithm: linear
> n-1<23212> ssi:boot:rsh:no_n: 0
> n-1<23212> ssi:boot:rsh:no_profile: 0
> n-1<23212> ssi:boot:rsh:fast: 0
> n-1<23212> ssi:boot:rsh:ignore_stderr: 0
> n-1<23212> ssi:boot:rsh:priority: 10
> n-1<23212> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<23212> ssi:boot:select: initializing boot module slurm
> n-1<23212> ssi:boot:slurm: not running under SLURM
> n-1<23212> ssi:boot:select: boot module not available: slurm
> n-1<23212> ssi:boot:select: finalizing boot module globus
> n-1<23212> ssi:boot:globus: finalizing
> n-1<23212> ssi:boot:select: closing boot module globus
> n-1<23212> ssi:boot:select: finalizing boot module slurm
> n-1<23212> ssi:boot:slurm: finalizing
> n-1<23212> ssi:boot:select: closing boot module slurm
> n-1<23212> ssi:boot:select: selected boot module rsh
> n-1<23212> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<23212> ssi:boot:base: <current directory>
> n-1<23212> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<23212> ssi:boot:base: $LAMHOME/etc
> n-1<23212> ssi:boot:base: /usr/local/etc
> n-1<23212> ssi:boot:base: looking for boot schema file:
> n-1<23212> ssi:boot:base: hosts
> n-1<23212> ssi:boot:base: found boot schema: hosts
> n-1<23212> ssi:boot:rsh: found the following hosts:
> n-1<23212> ssi:boot:rsh: n0 cfd004 (cpu=1)
> n-1<23212> ssi:boot:rsh: n1 cfd005 (cpu=1)
> n-1<23212> ssi:boot:rsh: resolved hosts:
> n-1<23212> ssi:boot:rsh: n0 cfd004 --> 10.1.0.15 (origin)
> n-1<23212> ssi:boot:rsh: n1 cfd005 --> 10.1.0.16
> n-1<23212> ssi:boot:rsh: starting RTE procs
> n-1<23212> ssi:boot:base:linear: starting
> n-1<23212> ssi:boot:base:linear: booting n0 (cfd004)
> n-1<23212> ssi:boot:rsh: starting wipe on (cfd004)
> n-1<23212> ssi:boot:rsh: starting on n0 (cfd004): tkill -d
> n-1<23212> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-kur_at_cfd004/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-kur_at_cfd004/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-kur_at_cfd004/lam-io-
> sockettkill: f_kill = "
> /tmp/lam-kur_at_cfd004/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 23207 ...
> tkill: killed
> tkill: all finished
> n-1<23212> ssi:boot:rsh: successfully launched on n0 (cfd004)
> n-1<23212> ssi:boot:base:linear: booting n1 (cfd005)
> n-1<23212> ssi:boot:rsh: starting wipe on (cfd005)
> n-1<23212> ssi:boot:rsh: starting on n1 (cfd005): tkill -d
> n-1<23212> ssi:boot:rsh: launching remotely
> n-1<23212> ssi:boot:rsh: attempting to execute: rsh cfd005 -n 'echo
> $SHELL'
> n-1<23212> ssi:boot:rsh: remote shell /bin/bash
> n-1<23212> ssi:boot:rsh: attempting to execute: rsh cfd005 -n tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-kur_at_cfd005/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-kur_at_cfd005/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-kur_at_cfd005/lam-io-
> sockettkill: f_kill = "
> /tmp/lam-kur_at_cfd005/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 24059 ...
> tkill: killed
> tkill: all finished
> n-1<23212> ssi:boot:rsh: successfully launched on n1 (cfd005)
> n-1<23212> ssi:boot:base:linear: finished
> n-1<23212> ssi:boot:rsh: all RTE procs started
> n-1<23212> ssi:boot:rsh: finalizing
> n-1<23212> ssi:boot: Closing
> lamboot did NOT complete successfully
> [kur_at_cfd004 kur]$
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Regards
> Kuriyan
>
> DISCLAIMER:
> This message,including any attachments contains confidential and
> privileged information for the sole use of the intended
> recipient(s), and is protected by law. If you are not the intended
> recipient, please destroy all copies of the original message. Any
> unauthorized review, use, disclosure, dissemination, forwarding,
> printing or copying of this email or any action taken in reliance on
> this e-mail is strictly prohibited and may be unlawful. Bajaj Auto
> reserves the right to record, monitor, and inspect all email
> communications through its internal and external networks. Your
> messages shall be subject to such lawful supervision as Bajaj Auto
> deems necessary in order to protect its information, interests and
> reputation. Bajaj Auto prohibits and takes steps to prevent its
> information systems from being used to view, store or forward
> offensive or discriminatory material. If this message contains such
> material, please report it to abuse_at_bajajauto.co.in._______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
Jeff Squyres
Cisco Systems
|