LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Yu-Cheng Chou (cycchou_at_[hidden])
Date: 2004-12-16 22:19:48


Hi, there:
Below is the error message occurred when i run lamboot command.
Have any idea to fix this booting problem?

$ lamboot -d machines
n-1<1856> ssi:boot: Opening
n-1<1856> ssi:boot: opening module globus
n-1<1856> ssi:boot: initializing module globus
n-1<1856> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<1856> ssi:boot: module not available: globus
n-1<1856> ssi:boot: opening module rsh
n-1<1856> ssi:boot: initializing module rsh
n-1<1856> ssi:boot:rsh: module initializing
n-1<1856> ssi:boot:rsh:agent: ssh -x
n-1<1856> ssi:boot:rsh:username: <same>
n-1<1856> ssi:boot:rsh:verbose: 1000
n-1<1856> ssi:boot:rsh:algorithm: linear
n-1<1856> ssi:boot:rsh:priority: 10
n-1<1856> ssi:boot: module available: rsh, priority: 10
n-1<1856> ssi:boot: finalizing module globus
n-1<1856> ssi:boot:globus: finalizing
n-1<1856> ssi:boot: closing module globus
n-1<1856> ssi:boot: Selected boot module rsh

LAM 7.0.6/MPI 2 C++ - Indiana University

n-1<1856> ssi:boot:base: looking for boot schema in following directories:
n-1<1856> ssi:boot:base: <current directory>
n-1<1856> ssi:boot:base: $TROLLIUSHOME/etc
n-1<1856> ssi:boot:base: $LAMHOME/etc
n-1<1856> ssi:boot:base: /usr/local/lam-7.0.6/etc
n-1<1856> ssi:boot:base: looking for boot schema file:
n-1<1856> ssi:boot:base: machines
n-1<1856> ssi:boot:base: found boot schema: machines
n-1<1856> ssi:boot:rsh: found the following hosts:
n-1<1856> ssi:boot:rsh: n0 snake.engr.ucdavis.edu (cpu=1)
n-1<1856> ssi:boot:rsh: n1 bird1.engr.ucdavis.edu (cpu=1)
n-1<1856> ssi:boot:rsh: resolved hosts:
n-1<1856> ssi:boot:rsh: n0 snake.engr.ucdavis.edu --> 169.237.108.56
(origin)
n-1<1856> ssi:boot:rsh: n1 bird1.engr.ucdavis.edu --> 169.237.108.59
n-1<1856> ssi:boot:rsh: starting RTE procs
n-1<1856> ssi:boot:base:linear: starting
n-1<1856> ssi:boot:base:server: opening server TCP socket
n-1<1856> ssi:boot:base:server: opened port 3119
n-1<1856> ssi:boot:base:linear: booting n0 (snake.engr.ucdavis.edu)
n-1<1856> ssi:boot:rsh: starting lamd on (snake.engr.ucdavis.edu)
n-1<1856> ssi:boot:rsh: starting on n0 (snake.engr.ucdavis.edu): hboot -t -
c lam-conf.lamd -d -I -H 169.237.108.56 -P 3119 -n 0 -o 0
n-1<1856> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-ycchou_at_snake/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-ycchou_at_snake/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-ycchou_at_snake/lam-io-socket
tkill: f_kill = "/tmp/lam-ycchou_at_snake/lam-killfile"
tkill: nothing to kill: "/tmp/lam-ycchou_at_snake/lam-killfile"
hboot: booting...
hboot: fork /usr/local/lam-7.0.6/bin/lamd
[1] 2188 lamd -H 169.237.108.56 -P 3119 -n 0 -o 0 -d
hboot: attempting to execute
n-1<1856> ssi:boot:rsh: successfully launched on n0
(snake.engr.ucdavis.edu)
n-1<1856> ssi:boot:base:server: expecting connection from finite list
n-1<2188> ssi:boot: Opening
n-1<2188> ssi:boot: opening module globus
n-1<2188> ssi:boot: initializing module globus
n-1<2188> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<2188> ssi:boot: module not available: globus
n-1<2188> ssi:boot: opening module rsh
n-1<2188> ssi:boot: initializing module rsh
n-1<2188> ssi:boot:rsh: module initializing
n-1<2188> ssi:boot:rsh:agent: ssh -x
n-1<2188> ssi:boot:rsh:username: <same>
n-1<2188> ssi:boot:rsh:verbose: 1000
n-1<2188> ssi:boot:rsh:algorithm: linear
n-1<2188> ssi:boot:rsh:priority: 10
n-1<2188> ssi:boot: module available: rsh, priority: 10
n-1<2188> ssi:boot: finalizing module globus
n-1<2188> ssi:boot:globus: finalizing
n-1<2188> ssi:boot: closing module globus
n-1<2188> ssi:boot: Selected boot module rsh
n-1<1856> ssi:boot:base:server: got connection from 169.237.108.56
n-1<1856> ssi:boot:base:server: this connection is expected (n0)
n-1<1856> ssi:boot:base:server: remote lamd is at 169.237.108.56:3148
n-1<1856> ssi:boot:base:linear: booting n1 (bird1.engr.ucdavis.edu)
n-1<1856> ssi:boot:rsh: starting lamd on (bird1.engr.ucdavis.edu)
n-1<1856> ssi:boot:rsh: starting on n1 (bird1.engr.ucdavis.edu): hboot -t -
c lam-conf.lamd -d -s -I "-H 169.237.108.56 -P 3119 -n 1 -o 0"
n-1<1856> ssi:boot:rsh: launching remotely
n-1<1856> ssi:boot:rsh: attempting to execute "ssh -x
bird1.engr.ucdavis.edu -n echo $SHELL"
n-1<1856> ssi:boot:rsh: remote shell /bin/ch
n-1<1856> ssi:boot:rsh: attempting to execute "ssh -x
bird1.engr.ucdavis.edu -n hboot -t -c lam-conf.lamd -d -s -I "-H
169.237.108.56 -P 3119 -n 1 -o 0""
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-ycchou_at_bird1/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-ycchou_at_bird1/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-ycchou_at_bird1/lam-io-socket
tkill: f_kill = "/tmp/lam-ycchou_at_bird1/lam-killfile"
tkill: nothing to kill: "/tmp/lam-ycchou_at_bird1/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /home/ycchou/lam-7.0.6/bin/lamd
[1] 30933 lamd -H 169.237.108.56 -P 3119 -n 1 -o 0 -d
n-1<1856> ssi:boot:rsh: successfully launched on n1
(bird1.engr.ucdavis.edu)
n-1<1856> ssi:boot:base:server: expecting connection from finite list
n-1<1856> ssi:boot:base:server: got connection from 169.237.108.59
n-1<1856> ssi:boot:base:server: this connection is expected (n1)
n-1<1856> ssi:boot:base:server: remote lamd is at 169.237.108.59:32851
n-1<1856> ssi:boot:base:server: closing server socket
n-1<1856> ssi:boot:base:server: connecting to lamd at 169.237.108.56:3149
n-1<1856> ssi:boot:base:server: connected
n-1<1856> ssi:boot:base:server: sending number of links (2)
n-1<1856> ssi:boot:base:server: sending info: n0 (snake.engr.ucdavis.edu)
n-1<1856> ssi:boot:base:server: sending info: n1 (bird1.engr.ucdavis.edu)
n-1<2188> ssi:boot:rsh: finalizing
n-1<2188> ssi:boot: Closing
n-1<1856> ssi:boot:base:server: finished sending
n-1<1856> ssi:boot:base:server: disconnected from 169.237.108.56:3149
n-1<1856> ssi:boot:base:server: connecting to lamd at 169.237.108.59:32974
---------------------------------------------------------------------------

--
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 169.237.108.59, port 32974.  
Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process.  As such, this may indicate:
        1. 169.237.108.59 is not the correct IP address for the machine 
where the
           newly-booted machine was launched
        2. There are network filters between the lamboot agent host and
           the remote host such that communication on random TCP ports
           is blocked
        3. Network routing from the the local host to the remote isn't
           properly configured (this is unlikely)
For number 1, check to ensure that 169.237.108.59 is the correct IP 
address for
that machine.  If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 169.237.108.59 is both reachable and is 
the by
the host where the lamboot agent is running, and is the correct host.
For numbers 2 and 4, try to telnet to 169.237.108.59, port 32974.  You 
should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port.  If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
---------------------------------------------------------------------------
--
n-1<1856> ssi:boot:base:linear: aborted!
---------------------------------------------------------------------------
--
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
---------------------------------------------------------------------------
--
n-1<1740> ssi:boot: Opening
n-1<1740> ssi:boot: opening module globus
n-1<1740> ssi:boot: initializing module globus
n-1<1740> ssi:boot:globus: globus-job-run not found, globus boot will not 
run
n-1<1740> ssi:boot: module not available: globus
n-1<1740> ssi:boot: opening module rsh
n-1<1740> ssi:boot: initializing module rsh
n-1<1740> ssi:boot:rsh: module initializing
n-1<1740> ssi:boot:rsh:agent: ssh -x
n-1<1740> ssi:boot:rsh:username: <same>
n-1<1740> ssi:boot:rsh:verbose: 1000
n-1<1740> ssi:boot:rsh:algorithm: linear
n-1<1740> ssi:boot:rsh:priority: 10
n-1<1740> ssi:boot: module available: rsh, priority: 10
n-1<1740> ssi:boot: finalizing module globus
n-1<1740> ssi:boot:globus: finalizing
n-1<1740> ssi:boot: closing module globus
n-1<1740> ssi:boot: Selected boot module rsh
n-1<1740> ssi:boot:base: looking for boot schema in following directories:
n-1<1740> ssi:boot:base:   <current directory>
n-1<1740> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<1740> ssi:boot:base:   $LAMHOME/etc
n-1<1740> ssi:boot:base:   /usr/local/lam-7.0.6/etc
n-1<1740> ssi:boot:base: looking for boot schema file:
n-1<1740> ssi:boot:base:   machines
n-1<1740> ssi:boot:base: found boot schema: machines
n-1<1740> ssi:boot:rsh: found the following hosts:
n-1<1740> ssi:boot:rsh:   n0 snake.engr.ucdavis.edu (cpu=1)
n-1<1740> ssi:boot:rsh:   n1 bird1.engr.ucdavis.edu (cpu=1)
n-1<1740> ssi:boot:rsh: resolved hosts:
n-1<1740> ssi:boot:rsh:   n0 snake.engr.ucdavis.edu --> 169.237.108.56 
(origin)
n-1<1740> ssi:boot:rsh:   n1 bird1.engr.ucdavis.edu --> 169.237.108.59
n-1<1740> ssi:boot:rsh: starting RTE procs
n-1<1740> ssi:boot:base:linear: starting
n-1<1740> ssi:boot:base:linear: booting n0 (snake.engr.ucdavis.edu)
n-1<1740> ssi:boot:rsh: starting wipe on (snake.engr.ucdavis.edu)
n-1<1740> ssi:boot:rsh: starting on n0 (snake.engr.ucdavis.edu): tkill -d
n-1<1740> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-ycchou_at_snake/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-ycchou_at_snake/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-ycchou_at_snake/lam-io-socket
tkill: f_kill = "/tmp/lam-ycchou_at_snake/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 2188 ...
tkill: killed
tkill: all finished
n-1<1740> ssi:boot:rsh: successfully launched on n0 
(snake.engr.ucdavis.edu)
n-1<1740> ssi:boot:base:linear: booting n1 (bird1.engr.ucdavis.edu)
n-1<1740> ssi:boot:rsh: starting wipe on (bird1.engr.ucdavis.edu)
n-1<1740> ssi:boot:rsh: starting on n1 (bird1.engr.ucdavis.edu): tkill -d
n-1<1740> ssi:boot:rsh: launching remotely
n-1<1740> ssi:boot:rsh: attempting to execute "ssh -x 
bird1.engr.ucdavis.edu -n echo $SHELL"
n-1<1740> ssi:boot:rsh: remote shell /bin/ch
n-1<1740> ssi:boot:rsh: attempting to execute "ssh -x 
bird1.engr.ucdavis.edu -n tkill -d"
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-ycchou_at_bird1/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-ycchou_at_bird1/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-ycchou_at_bird1/lam-io-socket
tkill: f_kill = "/tmp/lam-ycchou_at_bird1/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 30933 ...
tkill: killed
tkill: all finished
n-1<1740> ssi:boot:rsh: successfully launched on n1 
(bird1.engr.ucdavis.edu)
n-1<1740> ssi:boot:base:linear: finished
n-1<1740> ssi:boot:rsh: all RTE procs started
n-1<1740> ssi:boot:rsh: finalizing
n-1<1740> ssi:boot: Closing
lamboot did NOT complete successfully