[kur@cfd006 kur]$ lamboot -d hosts >> output n-1<31520> ssi:boot:open: opening n-1<31520> ssi:boot:open: opening boot module globus n-1<31520> ssi:boot:open: opened boot module globus n-1<31520> ssi:boot:open: opening boot module rsh n-1<31520> ssi:boot:open: opened boot module rsh n-1<31520> ssi:boot:open: opening boot module slurm n-1<31520> ssi:boot:open: opened boot module slurm n-1<31520> ssi:boot:select: initializing boot module slurm n-1<31520> ssi:boot:slurm: not running under SLURM n-1<31520> ssi:boot:select: boot module not available: slurm n-1<31520> ssi:boot:select: initializing boot module globus n-1<31520> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<31520> ssi:boot:select: boot module not available: globus n-1<31520> ssi:boot:select: initializing boot module rsh n-1<31520> ssi:boot:rsh: module initializing n-1<31520> ssi:boot:rsh:agent: rsh n-1<31520> ssi:boot:rsh:username: n-1<31520> ssi:boot:rsh:verbose: 1000 n-1<31520> ssi:boot:rsh:algorithm: linear n-1<31520> ssi:boot:rsh:no_n: 0 n-1<31520> ssi:boot:rsh:no_profile: 0 n-1<31520> ssi:boot:rsh:fast: 0 n-1<31520> ssi:boot:rsh:ignore_stderr: 0 n-1<31520> ssi:boot:rsh:priority: 10 n-1<31520> ssi:boot:select: boot module available: rsh, priority: 10 n-1<31520> ssi:boot:select: finalizing boot module slurm n-1<31520> ssi:boot:slurm: finalizing n-1<31520> ssi:boot:select: closing boot module slurm n-1<31520> ssi:boot:select: finalizing boot module globus n-1<31520> ssi:boot:globus: finalizing n-1<31520> ssi:boot:select: closing boot module globus n-1<31520> ssi:boot:select: selected boot module rsh n-1<31520> ssi:boot:base: looking for boot schema in following directories: n-1<31520> ssi:boot:base: n-1<31520> ssi:boot:base: $TROLLIUSHOME/etc n-1<31520> ssi:boot:base: $LAMHOME/etc n-1<31520> ssi:boot:base: /usr/local/etc n-1<31520> ssi:boot:base: looking for boot schema file: n-1<31520> ssi:boot:base: hosts n-1<31520> ssi:boot:base: found boot schema: hosts n-1<31520> ssi:boot:rsh: found the following hosts: n-1<31520> ssi:boot:rsh: n0 cfd006 (cpu=1) n-1<31520> ssi:boot:rsh: n1 cfd003 (cpu=1) n-1<31520> ssi:boot:rsh: resolved hosts: n-1<31520> ssi:boot:rsh: n0 cfd006 --> 10.1.0.17 (origin) n-1<31520> ssi:boot:rsh: n1 cfd003 --> 10.1.0.13 n-1<31520> ssi:boot:rsh: starting RTE procs n-1<31520> ssi:boot:base:linear: starting n-1<31520> ssi:boot:base:server: opening server TCP socket n-1<31520> ssi:boot:base:server: opened port 32952 n-1<31520> ssi:boot:base:linear: booting n0 (cfd006) n-1<31520> ssi:boot:rsh: starting lamd on (cfd006) n-1<31520> ssi:boot:rsh: starting on n0 (cfd006): hboot -t -c lam-conf.lamd -d - I -H 10.1.0.17 -P 32952 -n 0 -o 0 n-1<31520> ssi:boot:rsh: launching locally n-1<31520> ssi:boot:rsh: successfully launched on n0 (cfd006) n-1<31520> ssi:boot:base:server: expecting connection from finite list n-1<31523> ssi:boot:open: opening n-1<31523> ssi:boot:open: opening boot module globus n-1<31523> ssi:boot:open: opened boot module globus n-1<31523> ssi:boot:open: opening boot module rsh n-1<31523> ssi:boot:open: opened boot module rsh n-1<31523> ssi:boot:open: opening boot module slurm n-1<31523> ssi:boot:open: opened boot module slurm n-1<31523> ssi:boot:select: initializing boot module slurm n-1<31523> ssi:boot:slurm: not running under SLURM n-1<31523> ssi:boot:select: boot module not available: slurm n-1<31523> ssi:boot:select: initializing boot module globus n-1<31523> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<31523> ssi:boot:select: boot module not available: globus n-1<31523> ssi:boot:select: initializing boot module rsh n-1<31523> ssi:boot:rsh: module initializing n-1<31523> ssi:boot:rsh:agent: rsh n-1<31523> ssi:boot:rsh:username: n-1<31523> ssi:boot:rsh:verbose: 1000 n-1<31523> ssi:boot:rsh:algorithm: linear n-1<31523> ssi:boot:rsh:no_n: 0 n-1<31523> ssi:boot:rsh:no_profile: 0 n-1<31523> ssi:boot:rsh:fast: 0 n-1<31523> ssi:boot:rsh:ignore_stderr: 0 n-1<31523> ssi:boot:rsh:priority: 10 n-1<31523> ssi:boot:select: boot module available: rsh, priority: 10 n-1<31523> ssi:boot:select: finalizing boot module slurm n-1<31523> ssi:boot:slurm: finalizing n-1<31523> ssi:boot:select: closing boot module slurm n-1<31523> ssi:boot:select: finalizing boot module globus n-1<31523> ssi:boot:globus: finalizing n-1<31523> ssi:boot:select: closing boot module globus n-1<31523> ssi:boot:select: selected boot module rsh n-1<31523> ssi:boot:send_lamd: getting node ID from command line n-1<31523> ssi:boot:send_lamd: getting agent haddr from command line n-1<31523> ssi:boot:send_lamd: getting agent port from command line n-1<31523> ssi:boot:send_lamd: getting node ID from command line n-1<31523> ssi:boot:send_lamd: connecting to 10.1.0.17:32952, node id 0 n-1<31523> ssi:boot:send_lamd: sending dli_port 33257 n-1<31520> ssi:boot:base:server: got connection from 10.1.0.17 n-1<31520> ssi:boot:base:server: this connection is expected (n0) n-1<31520> ssi:boot:base:server: remote lamd is at 10.1.0.17:33257 n-1<31520> ssi:boot:base:linear: booting n1 (cfd003) n-1<31520> ssi:boot:rsh: starting lamd on (cfd003) n-1<31520> ssi:boot:rsh: starting on n1 (cfd003): hboot -t -c lam-conf.lamd -d - s -I "-H 10.1.0.17 -P 32952 -n 1 -o 0" n-1<31520> ssi:boot:rsh: launching remotely n-1<31520> ssi:boot:rsh: attempting to execute: rsh cfd003 -n 'echo $SHELL' n-1<31520> ssi:boot:rsh: remote shell /bin/bash n-1<31520> ssi:boot:rsh: attempting to execute: rsh cfd003 -n hboot -t -c lam-co nf.lamd -d -s -I '"-H 10.1.0.17 -P 32952 -n 1 -o 0"' n-1<31520> ssi:boot:rsh: successfully launched on n1 (cfd003) n-1<31520> ssi:boot:base:server: expecting connection from finite list n-1<31520> ssi:boot:base:server: got connection from 10.1.0.13 n-1<31520> ssi:boot:base:server: this connection is expected (n1) ----------------------------------------------------------------------------- The lamboot agent failed to read a message over a socket from the newly-booted process. This should not happen (especially since TCP is a guaranteed protocol). *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S *** MAILING LIST. You should probably check the following: - Network connectivity: Ensure that messages can be passed reliably over TCP using random ports. - Environment / PATH settings: Ensure that you are running the same version of LAM/MPI on all nodes. Sometimes premature disconnects (and therefore this error message) may be caused if mismatched versions of LAM are used on different nodes. - Node health: Ensure that the host where the newly-booted process was launched is healthy and still available on the network. ----------------------------------------------------------------------------- n-1<31520> ssi:boot:base:server: failed to connect to remote lamd! n-1<31520> ssi:boot:base:server: closing server socket n-1<31520> ssi:boot:base:linear: aborted! n-1<31528> ssi:boot:open: opening n-1<31528> ssi:boot:open: opening boot module globus n-1<31528> ssi:boot:open: opened boot module globus n-1<31528> ssi:boot:open: opening boot module rsh n-1<31528> ssi:boot:open: opened boot module rsh n-1<31528> ssi:boot:open: opening boot module slurm n-1<31528> ssi:boot:open: opened boot module slurm n-1<31528> ssi:boot:select: initializing boot module globus n-1<31528> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<31528> ssi:boot:select: boot module not available: globus n-1<31528> ssi:boot:select: initializing boot module rsh n-1<31528> ssi:boot:rsh: module initializing n-1<31528> ssi:boot:rsh:agent: rsh n-1<31528> ssi:boot:rsh:username: n-1<31528> ssi:boot:rsh:verbose: 1000 n-1<31528> ssi:boot:rsh:algorithm: linear n-1<31528> ssi:boot:rsh:no_n: 0 n-1<31528> ssi:boot:rsh:no_profile: 0 n-1<31528> ssi:boot:rsh:fast: 0 n-1<31528> ssi:boot:rsh:ignore_stderr: 0 n-1<31528> ssi:boot:rsh:priority: 10 n-1<31528> ssi:boot:select: boot module available: rsh, priority: 10 n-1<31528> ssi:boot:select: initializing boot module slurm n-1<31528> ssi:boot:slurm: not running under SLURM n-1<31528> ssi:boot:select: boot module not available: slurm n-1<31528> ssi:boot:select: finalizing boot module globus n-1<31528> ssi:boot:globus: finalizing n-1<31528> ssi:boot:select: closing boot module globus n-1<31528> ssi:boot:select: finalizing boot module slurm n-1<31528> ssi:boot:slurm: finalizing n-1<31528> ssi:boot:select: closing boot module slurm n-1<31528> ssi:boot:select: selected boot module rsh n-1<31528> ssi:boot:base: looking for boot schema in following directories: n-1<31528> ssi:boot:base: n-1<31528> ssi:boot:base: $TROLLIUSHOME/etc n-1<31528> ssi:boot:base: $LAMHOME/etc n-1<31528> ssi:boot:base: /usr/local/etc n-1<31528> ssi:boot:base: looking for boot schema file: n-1<31528> ssi:boot:base: hosts n-1<31528> ssi:boot:base: found boot schema: hosts n-1<31528> ssi:boot:rsh: found the following hosts: n-1<31528> ssi:boot:rsh: n0 cfd006 (cpu=1) n-1<31528> ssi:boot:rsh: n1 cfd003 (cpu=1) n-1<31528> ssi:boot:rsh: resolved hosts: n-1<31528> ssi:boot:rsh: n0 cfd006 --> 10.1.0.17 (origin) n-1<31528> ssi:boot:rsh: n1 cfd003 --> 10.1.0.13 n-1<31528> ssi:boot:rsh: starting RTE procs n-1<31528> ssi:boot:base:linear: starting n-1<31528> ssi:boot:base:linear: booting n0 (cfd006) n-1<31528> ssi:boot:rsh: starting wipe on (cfd006) n-1<31528> ssi:boot:rsh: starting on n0 (cfd006): tkill -d n-1<31528> ssi:boot:rsh: launching locally n-1<31528> ssi:boot:rsh: successfully launched on n0 (cfd006) n-1<31528> ssi:boot:base:linear: booting n1 (cfd003) n-1<31528> ssi:boot:rsh: starting wipe on (cfd003) n-1<31528> ssi:boot:rsh: starting on n1 (cfd003): tkill -d n-1<31528> ssi:boot:rsh: launching remotely n-1<31528> ssi:boot:rsh: attempting to execute: rsh cfd003 -n 'echo $SHELL' n-1<31528> ssi:boot:rsh: remote shell /bin/bash n-1<31528> ssi:boot:rsh: attempting to execute: rsh cfd003 -n tkill -d n-1<31528> ssi:boot:rsh: successfully launched on n1 (cfd003) n-1<31528> ssi:boot:base:linear: finished n-1<31528> ssi:boot:rsh: all RTE procs started n-1<31528> ssi:boot:rsh: finalizing n-1<31528> ssi:boot: Closing lamboot did NOT complete successfully [kur@cfd006 kur]$ ll