LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Marcelo Barreiro (barreiro_at_[hidden] ((barreiro_at_[hidden])
Date: 2005-07-21 13:47:55


Hello,

I have lam7.0.6 installed and I am trying to lamboot from 'labrador' with the following hostfile

labrador.princeton.edu
storm.princeton.edu cpu=2

I have defined LAMRSH=ssh -x

I did 'lamboot -d -v hostfile' and got the message listed below showing that lamboot did not complete.
There is an email in the list with a similar problem but I don't know how it was resolved.

I tried the suggestion of the message and login manually to the remote machine (storm) and then:
>'telnet 128.112.176.29 33559'
Trying 128.112.176.29...
telnet: Unable to connect to remote host: Connection refused

So, it seems that it is a firewall problem so that the ports are closed. Is there a range of ports that can be opened for LAM-MPI to work?
Thank you,

Marcelo

Lamboot message:

n-1<26923> ssi:boot: Opening
n-1<26923> ssi:boot: opening module globus
n-1<26923> ssi:boot: initializing module globus
n-1<26923> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<26923> ssi:boot: module not available: globus
n-1<26923> ssi:boot: opening module rsh
n-1<26923> ssi:boot: initializing module rsh
n-1<26923> ssi:boot:rsh: module initializing
n-1<26923> ssi:boot:rsh:agent: ssh -x
n-1<26923> ssi:boot:rsh:username: <same>
n-1<26923> ssi:boot:rsh:verbose: 1000
n-1<26923> ssi:boot:rsh:algorithm: linear
n-1<26923> ssi:boot:rsh:priority: 10
n-1<26923> ssi:boot: module available: rsh, priority: 10
n-1<26923> ssi:boot: opening module tm
n-1<26923> ssi:boot: initializing module tm
n-1<26923> ssi:boot:tm: not running under PBS
n-1<26923> ssi:boot: module not available: tm
n-1<26923> ssi:boot: finalizing module globus
n-1<26923> ssi:boot:globus: finalizing
n-1<26923> ssi:boot: closing module globus
n-1<26923> ssi:boot: finalizing module tm
n-1<26923> ssi:boot:tm: finalizing
n-1<26923> ssi:boot: closing module tm
n-1<26923> ssi:boot: Selected boot module rsh
 
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
 
n-1<26923> ssi:boot:base: looking for boot schema in following directories:
n-1<26923> ssi:boot:base: <current directory>
n-1<26923> ssi:boot:base: $TROLLIUSHOME/etc
n-1<26923> ssi:boot:base: $LAMHOME/etc
n-1<26923> ssi:boot:base: /home/splash/marcelo/bin/lam7_bin_pgi/etc
n-1<26923> ssi:boot:base: looking for boot schema file:
n-1<26923> ssi:boot:base: hostfile
n-1<26923> ssi:boot:base: found boot schema: hostfile
n-1<26923> ssi:boot:rsh: found the following hosts:
n-1<26923> ssi:boot:rsh: n0 labrador.princeton.edu (cpu=1)
n-1<26923> ssi:boot:rsh: n1 storm.princeton.edu (cpu=2)
n-1<26923> ssi:boot:rsh: resolved hosts:
n-1<26923> ssi:boot:rsh: n0 labrador.princeton.edu --> 128.112.176.29 (origin)
n-1<26923> ssi:boot:rsh: n1 storm.princeton.edu --> 128.112.177.153
n-1<26923> ssi:boot:rsh: starting RTE procs
n-1<26923> ssi:boot:base:linear: starting
n-1<26923> ssi:boot:base:server: opening server TCP socket
n-1<26923> ssi:boot:base:server: opened port 33559
n-1<26923> ssi:boot:base:linear: booting n0 (labrador.princeton.edu)
n-1<26923> ssi:boot:rsh: starting lamd on (labrador.princeton.edu)
n-1<26923> ssi:boot:rsh: starting on n0 (labrador.princeton.edu): hboot -t -c lam-conf.lamd -d -v -I -H 128.112.176.29 -P 33559 -n 0 -o 0
n-1<26923> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-marcelo_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-marcelo_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
tkill: nothing to kill: "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
hboot: booting...
hboot: fork /home/splash/marcelo/bin/lam7_bin_pgi/bin/lamd
hboot: attempting to execute
[1] 26926 lamd -H 128.112.176.29 -P 33559 -n 0 -o 0 -d
n-1<26923> ssi:boot:rsh: successfully launched on n0 (labrador.princeton.edu)
n-1<26923> ssi:boot:base:server: expecting connection from finite list
n-1<26926> ssi:boot: Opening
n-1<26926> ssi:boot: opening module globus
n-1<26926> ssi:boot: initializing module globus
n-1<26926> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<26926> ssi:boot: module not available: globus
n-1<26926> ssi:boot: opening module rsh
n-1<26926> ssi:boot: initializing module rsh
n-1<26926> ssi:boot:rsh: module initializing
n-1<26926> ssi:boot:rsh:agent: ssh -x
n-1<26926> ssi:boot:rsh:username: <same>
n-1<26926> ssi:boot:rsh:verbose: 1000
n-1<26926> ssi:boot:rsh:algorithm: linear
n-1<26926> ssi:boot:rsh:priority: 10
n-1<26926> ssi:boot: module available: rsh, priority: 10
n-1<26926> ssi:boot: opening module tm
n-1<26926> ssi:boot: initializing module tm
n-1<26926> ssi:boot:tm: not running under PBS
n-1<26926> ssi:boot: module not available: tm
n-1<26926> ssi:boot: finalizing module globus
n-1<26926> ssi:boot:globus: finalizing
n-1<26926> ssi:boot: closing module globus
n-1<26926> ssi:boot: finalizing module tm
n-1<26926> ssi:boot:tm: finalizing
n-1<26926> ssi:boot: closing module tm
n-1<26926> ssi:boot: Selected boot module rsh
n-1<26923> ssi:boot:base:server: got connection from 128.112.176.29
n-1<26923> ssi:boot:base:server: this connection is expected (n0)
n-1<26923> ssi:boot:base:server: remote lamd is at 128.112.176.29:33247
n-1<26923> ssi:boot:base:linear: booting n1 (storm.princeton.edu)
n-1<26923> ssi:boot:rsh: starting lamd on (storm.princeton.edu)
n-1<26923> ssi:boot:rsh: starting on n1 (storm.princeton.edu): hboot -t -c lam-conf.lamd -d -v -s -I "-H 128.112.176.29 -P 33559 -n 1 -o 0"
n-1<26923> ssi:boot:rsh: launching remotely
n-1<26923> ssi:boot:rsh: attempting to execute "ssh -x storm.princeton.edu -n echo $SHELL"
n-1<26923> ssi:boot:rsh: remote shell appending ferret path
/bin/tcsh
n-1<26923> ssi:boot:rsh: attempting to execute "ssh -x storm.princeton.edu -n hboot -t -c lam-conf.lamd -d -v -s -I "-H 128.112.176.29 -P 33559 -n 1 -o 0""
appending ferret path
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-marcelo_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-marcelo_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
tkill: nothing to kill: "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /home/splash/marcelo/bin/lam7_bin_pgi/bin/lamd
[1] 31378 lamd -H 128.112.176.29 -P 33559 -n 1 -o 0 -d
n-1<26923> ssi:boot:rsh: successfully launched on n1 (storm.princeton.edu)
n-1<26923> ssi:boot:base:server: expecting connection from finite list
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
 
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
 
        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)
 
You can check these things by watching the output from "lamboot -d".
 
1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).
 
2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------
n-1<26923> ssi:boot:base:server: failed to connect to remote lamd!
n-1<26923> ssi:boot:base:server: closing server socket
n-1<26923> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
 
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n-1<26935> ssi:boot: Opening
n-1<26935> ssi:boot: opening module globus
n-1<26935> ssi:boot: initializing module globus
n-1<26935> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<26935> ssi:boot: module not available: globus
n-1<26935> ssi:boot: opening module rsh
n-1<26935> ssi:boot: initializing module rsh
n-1<26935> ssi:boot:rsh: module initializing
n-1<26935> ssi:boot:rsh:agent: ssh -x
n-1<26935> ssi:boot:rsh:username: <same>
n-1<26935> ssi:boot:rsh:verbose: 1000
n-1<26935> ssi:boot:rsh:algorithm: linear
n-1<26935> ssi:boot:rsh:priority: 10
n-1<26935> ssi:boot: module available: rsh, priority: 10
n-1<26935> ssi:boot: opening module tm
n-1<26935> ssi:boot: initializing module tm
n-1<26935> ssi:boot:tm: not running under PBS
n-1<26935> ssi:boot: module not available: tm
n-1<26935> ssi:boot: finalizing module globus
n-1<26935> ssi:boot:globus: finalizing
n-1<26935> ssi:boot: closing module globus
n-1<26935> ssi:boot: finalizing module tm
n-1<26935> ssi:boot:tm: finalizing
n-1<26935> ssi:boot: closing module tm
n-1<26935> ssi:boot: Selected boot module rsh
n-1<26935> ssi:boot:base: looking for boot schema in following directories:
n-1<26935> ssi:boot:base: <current directory>
n-1<26935> ssi:boot:base: $TROLLIUSHOME/etc
n-1<26935> ssi:boot:base: $LAMHOME/etc
n-1<26935> ssi:boot:base: /home/splash/marcelo/bin/lam7_bin_pgi/etc
n-1<26935> ssi:boot:base: looking for boot schema file:
n-1<26935> ssi:boot:base: hostfile
n-1<26935> ssi:boot:base: found boot schema: hostfile
n-1<26935> ssi:boot:rsh: found the following hosts:
n-1<26935> ssi:boot:rsh: n0 labrador.princeton.edu (cpu=1)
n-1<26935> ssi:boot:rsh: n1 storm.princeton.edu (cpu=2)
n-1<26935> ssi:boot:rsh: resolved hosts:
n-1<26935> ssi:boot:rsh: n0 labrador.princeton.edu --> 128.112.176.29 (origin)
n-1<26935> ssi:boot:rsh: n1 storm.princeton.edu --> 128.112.177.153
n-1<26935> ssi:boot:rsh: starting RTE procs
n-1<26935> ssi:boot:base:linear: starting
n-1<26935> ssi:boot:base:linear: booting n0 (labrador.princeton.edu)
n-1<26935> ssi:boot:rsh: starting wipe on (labrador.princeton.edu)
n-1<26935> ssi:boot:rsh: starting on n0 (labrador.princeton.edu): tkill -d -v
n-1<26935> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-marcelo_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-marcelo_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 26926 ...
tkill: killed
tkill: all finished
n-1<26935> ssi:boot:rsh: successfully launched on n0 (labrador.princeton.edu)
n-1<26935> ssi:boot:base:linear: booting n1 (storm.princeton.edu)
n-1<26935> ssi:boot:rsh: starting wipe on (storm.princeton.edu)
n-1<26935> ssi:boot:rsh: starting on n1 (storm.princeton.edu): tkill -d -v
n-1<26935> ssi:boot:rsh: launching remotely
n-1<26935> ssi:boot:rsh: attempting to execute "ssh -x storm.princeton.edu -n echo $SHELL"
n-1<26935> ssi:boot:rsh: remote shell appending ferret path
/bin/tcsh
n-1<26935> ssi:boot:rsh: attempting to execute "ssh -x storm.princeton.edu -n tkill -d -v"
appending ferret path
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-marcelo_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-marcelo_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 31378 ...
tkill: already dead
tkill: all finished
n-1<26935> ssi:boot:rsh: successfully launched on n1 (storm.princeton.edu)
n-1<26935> ssi:boot:base:linear: finished
n-1<26935> ssi:boot:rsh: all RTE procs started
n-1<26935> ssi:boot:rsh: finalizing
n-1<26935> ssi:boot: Closing
lamboot did NOT complete successfully

--
Program in Atmospheric and Oceanic Sciences
205 Sayre Hall, Forrestal Campus
Princeton University, Princeton, NJ 08544-0710 
Tel: (609) 258-1319 / Fax: (609) 258-2850