n-1<31166> ssi:boot:open: opening n-1<31166> ssi:boot:open: opening boot module globus n-1<31166> ssi:boot:open: opened boot module globus n-1<31166> ssi:boot:open: opening boot module rsh n-1<31166> ssi:boot:open: opened boot module rsh n-1<31166> ssi:boot:open: opening boot module slurm n-1<31166> ssi:boot:open: opened boot module slurm n-1<31166> ssi:boot:open: opening boot module tm n-1<31166> ssi:boot:open: opened boot module tm n-1<31166> ssi:boot:select: initializing boot module tm n-1<31166> ssi:boot:tm: not running under PBS n-1<31166> ssi:boot:select: boot module not available: tm n-1<31166> ssi:boot:select: initializing boot module slurm n-1<31166> ssi:boot:slurm: not running under SLURM n-1<31166> ssi:boot:select: boot module not available: slurm n-1<31166> ssi:boot:select: initializing boot module rsh n-1<31166> ssi:boot:rsh: module initializing n-1<31166> ssi:boot:rsh:agent: /usr/bin/rsh n-1<31166> ssi:boot:rsh:username: n-1<31166> ssi:boot:rsh:verbose: 1000 n-1<31166> ssi:boot:rsh:algorithm: linear n-1<31166> ssi:boot:rsh:no_n: 0 n-1<31166> ssi:boot:rsh:no_profile: 0 n-1<31166> ssi:boot:rsh:fast: 0 n-1<31166> ssi:boot:rsh:ignore_stderr: 0 n-1<31166> ssi:boot:rsh:priority: 10 n-1<31166> ssi:boot:select: boot module available: rsh, priority: 10 n-1<31166> ssi:boot:select: initializing boot module globus n-1<31166> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<31166> ssi:boot:select: boot module not available: globus n-1<31166> ssi:boot:select: finalizing boot module tm n-1<31166> ssi:boot:tm: finalizing n-1<31166> ssi:boot:select: closing boot module tm n-1<31166> ssi:boot:select: finalizing boot module slurm n-1<31166> ssi:boot:slurm: finalizing n-1<31166> ssi:boot:select: closing boot module slurm n-1<31166> ssi:boot:select: finalizing boot module globus n-1<31166> ssi:boot:globus: finalizing n-1<31166> ssi:boot:select: closing boot module globus n-1<31166> ssi:boot:select: selected boot module rsh n-1<31166> ssi:boot:base: looking for boot schema in following directories: n-1<31166> ssi:boot:base: n-1<31166> ssi:boot:base: $TROLLIUSHOME/etc n-1<31166> ssi:boot:base: $LAMHOME/etc n-1<31166> ssi:boot:base: /usr/local/lam-7.1.4/etc n-1<31166> ssi:boot:base: looking for boot schema file: n-1<31166> ssi:boot:base: hostfile n-1<31166> ssi:boot:base: found boot schema: hostfile n-1<31166> ssi:boot:rsh: found the following hosts: n-1<31166> ssi:boot:rsh: n0 reconcluster.bme.columbia.edu (cpu=1) n-1<31166> ssi:boot:rsh: n1 10.0.0.2 (cpu=1) n-1<31166> ssi:boot:rsh: resolved hosts: n-1<31166> ssi:boot:rsh: n0 reconcluster.bme.columbia.edu --> 128.59.145.83 (origin) n-1<31166> ssi:boot:rsh: n1 10.0.0.2 --> 10.0.0.2 n-1<31166> ssi:boot:rsh: starting RTE procs n-1<31166> ssi:boot:base:linear: starting n-1<31166> ssi:boot:base:server: opening server TCP socket n-1<31166> ssi:boot:base:server: opened port 46722 n-1<31166> ssi:boot:base:linear: booting n0 (reconcluster.bme.columbia.edu) n-1<31166> ssi:boot:rsh: starting lamd on (reconcluster.bme.columbia.edu) n-1<31166> ssi:boot:rsh: starting on n0 (reconcluster.bme.columbia.edu): hboot -t -c lam-conf.lamd -d -I -H 128.59.145.83 -P 46722 -n 0 -o 0 n-1<31166> ssi:boot:rsh: launching locally tkill: setting prefix to (null) tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-xuejun@reconcluster/lam-killfile tkill: f_kill = "/tmp/lam-xuejun@reconcluster/lam-killfile" tkill: nothing to kill: "/tmp/lam-xuejun@reconcluster/lam-killfile" hboot: performing tkill hboot: tkill -d hboot: booting... hboot: fork /usr/local/mpi/bin/lamd [1] 31169 lamd -H 128.59.145.83 -P 46722 -n 0 -o 0 -d n-1<31166> ssi:boot:rsh: successfully launched on n0 (reconcluster.bme.columbia.edu) n-1<31166> ssi:boot:base:server: expecting connection from finite list n-1<31169> ssi:boot:open: opening n-1<31169> ssi:boot:open: opening boot module globus n-1<31169> ssi:boot:open: opened boot module globus n-1<31169> ssi:boot:open: opening boot module rsh n-1<31169> ssi:boot:open: opened boot module rsh n-1<31169> ssi:boot:open: opening boot module slurm n-1<31169> ssi:boot:open: opened boot module slurm n-1<31169> ssi:boot:open: opening boot module tm n-1<31169> ssi:boot:open: opened boot module tm n-1<31169> ssi:boot:select: initializing boot module tm n-1<31169> ssi:boot:tm: not running under PBS n-1<31169> ssi:boot:select: boot module not available: tm n-1<31169> ssi:boot:select: initializing boot module slurm n-1<31169> ssi:boot:slurm: not running under SLURM n-1<31169> ssi:boot:select: boot module not available: slurm n-1<31169> ssi:boot:select: initializing boot module rsh n-1<31169> ssi:boot:rsh: module initializing n-1<31169> ssi:boot:rsh:agent: /usr/bin/rsh n-1<31169> ssi:boot:rsh:username: n-1<31169> ssi:boot:rsh:verbose: 1000 n-1<31169> ssi:boot:rsh:algorithm: linear n-1<31169> ssi:boot:rsh:no_n: 0 n-1<31169> ssi:boot:rsh:no_profile: 0 n-1<31169> ssi:boot:rsh:fast: 0 n-1<31169> ssi:boot:rsh:ignore_stderr: 0 n-1<31169> ssi:boot:rsh:priority: 10 n-1<31169> ssi:boot:select: boot module available: rsh, priority: 10 n-1<31169> ssi:boot:select: initializing boot module globus n-1<31169> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<31169> ssi:boot:select: boot module not available: globus n-1<31169> ssi:boot:select: finalizing boot module tm n-1<31169> ssi:boot:tm: finalizing n-1<31169> ssi:boot:select: closing boot module tm n-1<31169> ssi:boot:select: finalizing boot module slurm n-1<31169> ssi:boot:slurm: finalizing n-1<31169> ssi:boot:select: closing boot module slurm n-1<31169> ssi:boot:select: finalizing boot module globus n-1<31169> ssi:boot:globus: finalizing n-1<31169> ssi:boot:select: closing boot module globus n-1<31169> ssi:boot:select: selected boot module rsh n-1<31169> ssi:boot:send_lamd: getting node ID from command line n-1<31169> ssi:boot:send_lamd: getting agent haddr from command line n-1<31169> ssi:boot:send_lamd: getting agent port from command line n-1<31169> ssi:boot:send_lamd: getting node ID from command line n-1<31169> ssi:boot:send_lamd: connecting to 128.59.145.83:46722, node id 0 n-1<31169> ssi:boot:send_lamd: sending dli_port 47668 n-1<31166> ssi:boot:base:server: got connection from 128.59.145.83 n-1<31166> ssi:boot:base:server: this connection is expected (n0) n-1<31166> ssi:boot:base:server: remote lamd is at 128.59.145.83:47668 n-1<31166> ssi:boot:base:linear: booting n1 (10.0.0.2) n-1<31166> ssi:boot:rsh: starting lamd on (10.0.0.2) n-1<31166> ssi:boot:rsh: starting on n1 (10.0.0.2): hboot -t -c lam-conf.lamd -d -s -I "-H 128.59.145.83 -P 46722 -n 1 -o 0" n-1<31166> ssi:boot:rsh: launching remotely n-1<31166> ssi:boot:rsh: attempting to execute: /usr/bin/rsh 10.0.0.2 -n 'echo $SHELL' n-1<31166> ssi:boot:rsh: remote shell /bin/bash n-1<31166> ssi:boot:rsh: attempting to execute: /usr/bin/rsh 10.0.0.2 -n hboot -t -c lam-conf.lamd -d -s -I '"-H 128.59.145.83 -P 46722 -n 1 -o 0"' tkill: setting prefix to (null) LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-xuejun@node2/lam-killfile tkill: f_kill = "/tmp/lam-xuejun@node2/lam-killfile" tkill: nothing to kill: "/tmp/lam-xuejun@node2/lam-killfile" hboot: performing tkill hboot: tkill -d hboot: booting... hboot: fork /usr/local/mpi/bin/lamd [1] 6364 lamd -H 128.59.145.83 -P 46722 -n 1 -o 0 -d n-1<31166> ssi:boot:rsh: successfully launched on n1 (10.0.0.2) n-1<31166> ssi:boot:base:server: expecting connection from finite list ----------------------------------------------------------------------------- The lamboot agent timed out while waiting for the newly-booted process to call back and indicated that it had successfully booted. *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S *** MAILING LIST. As far as LAM could tell, the remote process started properly, but then never called back. Possible reasons that this may happen: - There are network filters between the lamboot agent host and the remote host such that communication on random TCP ports is blocked - Network routing from the remote host to the local host isn't properly configured (this is uncommon) You can check these things by watching the output from "lamboot -d". 1. On the command line for hboot, there are two important parameters: one is the IP address of where the lamboot agent was invoked, the other is the port number that the lamboot agent is expecting the newly-booted process to call back on (this will be a random integer). 2. Manually login to the remote machine and try to telnet to the port indicated on the hboot command line. For example, telnet If all goes well, you should get a "Connection refused" error. If you get any other kind of error, it could indicate either of the two conditions above. Consult with your system/network administrator. ----------------------------------------------------------------------------- n-1<31166> ssi:boot:base:server: failed to connect to remote lamd! n-1<31166> ssi:boot:base:server: closing server socket n-1<31166> ssi:boot:base:linear: aborted! n-1<31179> ssi:boot:open: opening n-1<31179> ssi:boot:open: opening boot module globus n-1<31179> ssi:boot:open: opened boot module globus n-1<31179> ssi:boot:open: opening boot module rsh n-1<31179> ssi:boot:open: opened boot module rsh n-1<31179> ssi:boot:open: opening boot module slurm n-1<31179> ssi:boot:open: opened boot module slurm n-1<31179> ssi:boot:open: opening boot module tm n-1<31179> ssi:boot:open: opened boot module tm n-1<31179> ssi:boot:select: initializing boot module tm n-1<31179> ssi:boot:tm: not running under PBS n-1<31179> ssi:boot:select: boot module not available: tm n-1<31179> ssi:boot:select: initializing boot module slurm n-1<31179> ssi:boot:slurm: not running under SLURM n-1<31179> ssi:boot:select: boot module not available: slurm n-1<31179> ssi:boot:select: initializing boot module rsh n-1<31179> ssi:boot:rsh: module initializing n-1<31179> ssi:boot:rsh:agent: /usr/bin/rsh n-1<31179> ssi:boot:rsh:username: n-1<31179> ssi:boot:rsh:verbose: 1000 n-1<31179> ssi:boot:rsh:algorithm: linear n-1<31179> ssi:boot:rsh:no_n: 0 n-1<31179> ssi:boot:rsh:no_profile: 0 n-1<31179> ssi:boot:rsh:fast: 0 n-1<31179> ssi:boot:rsh:ignore_stderr: 0 n-1<31179> ssi:boot:rsh:priority: 10 n-1<31179> ssi:boot:select: boot module available: rsh, priority: 10 n-1<31179> ssi:boot:select: initializing boot module globus n-1<31179> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<31179> ssi:boot:select: boot module not available: globus n-1<31179> ssi:boot:select: finalizing boot module tm n-1<31179> ssi:boot:tm: finalizing n-1<31179> ssi:boot:select: closing boot module tm n-1<31179> ssi:boot:select: finalizing boot module slurm n-1<31179> ssi:boot:slurm: finalizing n-1<31179> ssi:boot:select: closing boot module slurm n-1<31179> ssi:boot:select: finalizing boot module globus n-1<31179> ssi:boot:globus: finalizing n-1<31179> ssi:boot:select: closing boot module globus n-1<31179> ssi:boot:select: selected boot module rsh n-1<31179> ssi:boot:base: looking for boot schema in following directories: n-1<31179> ssi:boot:base: n-1<31179> ssi:boot:base: $TROLLIUSHOME/etc n-1<31179> ssi:boot:base: $LAMHOME/etc n-1<31179> ssi:boot:base: /usr/local/lam-7.1.4/etc n-1<31179> ssi:boot:base: looking for boot schema file: n-1<31179> ssi:boot:base: hostfile n-1<31179> ssi:boot:base: found boot schema: hostfile n-1<31179> ssi:boot:rsh: found the following hosts: n-1<31179> ssi:boot:rsh: n0 reconcluster.bme.columbia.edu (cpu=1) n-1<31179> ssi:boot:rsh: n1 10.0.0.2 (cpu=1) n-1<31179> ssi:boot:rsh: resolved hosts: n-1<31179> ssi:boot:rsh: n0 reconcluster.bme.columbia.edu --> 128.59.145.83 (origin) n-1<31179> ssi:boot:rsh: n1 10.0.0.2 --> 10.0.0.2 n-1<31179> ssi:boot:rsh: starting RTE procs n-1<31179> ssi:boot:base:linear: starting n-1<31179> ssi:boot:base:linear: booting n0 (reconcluster.bme.columbia.edu) n-1<31179> ssi:boot:rsh: starting wipe on (reconcluster.bme.columbia.edu) n-1<31179> ssi:boot:rsh: starting on n0 (reconcluster.bme.columbia.edu): tkill -d n-1<31179> ssi:boot:rsh: launching locally tkill: setting prefix to (null) tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-xuejun@reconcluster/lam-killfile tkill: f_kill = "/tmp/lam-xuejun@reconcluster/lam-killfile" tkill: killing LAM... tkill: killing PID (SIGHUP) 31169 ... tkill: killed tkill: removing socket file ... tkill: socket file: /tmp/lam-xuejun@reconcluster/lam-kernel-socketd tkill: removing IO daemon socket file ... tkill: IO daemon socket file: /tmp/lam-xuejun@reconcluster/lam-io-socket tkill: all finished n-1<31179> ssi:boot:rsh: successfully launched on n0 (reconcluster.bme.columbia.edu) n-1<31179> ssi:boot:base:linear: booting n1 (10.0.0.2) n-1<31179> ssi:boot:rsh: starting wipe on (10.0.0.2) n-1<31179> ssi:boot:rsh: starting on n1 (10.0.0.2): tkill -d n-1<31179> ssi:boot:rsh: launching remotely n-1<31179> ssi:boot:rsh: attempting to execute: /usr/bin/rsh 10.0.0.2 -n 'echo $SHELL' n-1<31179> ssi:boot:rsh: remote shell /bin/bash n-1<31179> ssi:boot:rsh: attempting to execute: /usr/bin/rsh 10.0.0.2 -n tkill -d tkill: setting prefix to (null) tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-xuejun@node2/lam-killfile tkill: f_kill = "/tmp/lam-xuejun@node2/lam-killfile" tkill: killing LAM... tkill: killing PID (SIGHUP) 6364 ... tkill: already dead tkill: removing socket file ... tkill: socket file: /tmp/lam-xuejun@node2/lam-kernel-socketd tkill: removing IO daemon socket file ... tkill: IO daemon socket file: /tmp/lam-xuejun@node2/lam-io-socket tkill: all finished n-1<31179> ssi:boot:rsh: successfully launched on n1 (10.0.0.2) n-1<31179> ssi:boot:base:linear: finished n-1<31179> ssi:boot:rsh: all RTE procs started n-1<31179> ssi:boot:rsh: finalizing n-1<31179> ssi:boot: Closing lamboot did NOT complete successfully