LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-07-23 06:55:33


Yes, a firewall is *usually* (but not always) the problem in situations
like this.

Do you have a friendly sysadmin around that you can ask about the setup
between these two machines?

On Jul 21, 2005, at 2:47 PM, Marcelo Barreiro (barreiro_at_[hidden])
wrote:

> Hello,
>
> I have lam7.0.6 installed and I am trying to lamboot from 'labrador'
> with the following hostfile
>
> labrador.princeton.edu
> storm.princeton.edu cpu=2
>
> I have defined LAMRSH=ssh -x
>
> I did 'lamboot -d -v hostfile' and got the message listed below
> showing that lamboot did not complete.
> There is an email in the list with a similar problem but I don't know
> how it was resolved.
>
> I tried the suggestion of the message and login manually to the remote
> machine (storm) and then:
>> 'telnet 128.112.176.29 33559'
> Trying 128.112.176.29...
> telnet: Unable to connect to remote host: Connection refused
>
> So, it seems that it is a firewall problem so that the ports are
> closed. Is there a range of ports that can be opened for LAM-MPI to
> work?
> Thank you,
>
> Marcelo
>
> Lamboot message:
>
> n-1<26923> ssi:boot: Opening
> n-1<26923> ssi:boot: opening module globus
> n-1<26923> ssi:boot: initializing module globus
> n-1<26923> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<26923> ssi:boot: module not available: globus
> n-1<26923> ssi:boot: opening module rsh
> n-1<26923> ssi:boot: initializing module rsh
> n-1<26923> ssi:boot:rsh: module initializing
> n-1<26923> ssi:boot:rsh:agent: ssh -x
> n-1<26923> ssi:boot:rsh:username: <same>
> n-1<26923> ssi:boot:rsh:verbose: 1000
> n-1<26923> ssi:boot:rsh:algorithm: linear
> n-1<26923> ssi:boot:rsh:priority: 10
> n-1<26923> ssi:boot: module available: rsh, priority: 10
> n-1<26923> ssi:boot: opening module tm
> n-1<26923> ssi:boot: initializing module tm
> n-1<26923> ssi:boot:tm: not running under PBS
> n-1<26923> ssi:boot: module not available: tm
> n-1<26923> ssi:boot: finalizing module globus
> n-1<26923> ssi:boot:globus: finalizing
> n-1<26923> ssi:boot: closing module globus
> n-1<26923> ssi:boot: finalizing module tm
> n-1<26923> ssi:boot:tm: finalizing
> n-1<26923> ssi:boot: closing module tm
> n-1<26923> ssi:boot: Selected boot module rsh
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<26923> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<26923> ssi:boot:base: <current directory>
> n-1<26923> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<26923> ssi:boot:base: $LAMHOME/etc
> n-1<26923> ssi:boot:base: /home/splash/marcelo/bin/lam7_bin_pgi/etc
> n-1<26923> ssi:boot:base: looking for boot schema file:
> n-1<26923> ssi:boot:base: hostfile
> n-1<26923> ssi:boot:base: found boot schema: hostfile
> n-1<26923> ssi:boot:rsh: found the following hosts:
> n-1<26923> ssi:boot:rsh: n0 labrador.princeton.edu (cpu=1)
> n-1<26923> ssi:boot:rsh: n1 storm.princeton.edu (cpu=2)
> n-1<26923> ssi:boot:rsh: resolved hosts:
> n-1<26923> ssi:boot:rsh: n0 labrador.princeton.edu -->
> 128.112.176.29 (origin)
> n-1<26923> ssi:boot:rsh: n1 storm.princeton.edu --> 128.112.177.153
> n-1<26923> ssi:boot:rsh: starting RTE procs
> n-1<26923> ssi:boot:base:linear: starting
> n-1<26923> ssi:boot:base:server: opening server TCP socket
> n-1<26923> ssi:boot:base:server: opened port 33559
> n-1<26923> ssi:boot:base:linear: booting n0 (labrador.princeton.edu)
> n-1<26923> ssi:boot:rsh: starting lamd on (labrador.princeton.edu)
> n-1<26923> ssi:boot:rsh: starting on n0 (labrador.princeton.edu):
> hboot -t -c lam-conf.lamd -d -v -I -H 128.112.176.29 -P 33559 -n 0 -o
> 0
> n-1<26923> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back:
> /tmp/lam-marcelo_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
> tkill: nothing to kill:
> "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
> hboot: booting...
> hboot: fork /home/splash/marcelo/bin/lam7_bin_pgi/bin/lamd
> hboot: attempting to execute
> [1] 26926 lamd -H 128.112.176.29 -P 33559 -n 0 -o 0 -d
> n-1<26923> ssi:boot:rsh: successfully launched on n0
> (labrador.princeton.edu)
> n-1<26923> ssi:boot:base:server: expecting connection from finite list
> n-1<26926> ssi:boot: Opening
> n-1<26926> ssi:boot: opening module globus
> n-1<26926> ssi:boot: initializing module globus
> n-1<26926> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<26926> ssi:boot: module not available: globus
> n-1<26926> ssi:boot: opening module rsh
> n-1<26926> ssi:boot: initializing module rsh
> n-1<26926> ssi:boot:rsh: module initializing
> n-1<26926> ssi:boot:rsh:agent: ssh -x
> n-1<26926> ssi:boot:rsh:username: <same>
> n-1<26926> ssi:boot:rsh:verbose: 1000
> n-1<26926> ssi:boot:rsh:algorithm: linear
> n-1<26926> ssi:boot:rsh:priority: 10
> n-1<26926> ssi:boot: module available: rsh, priority: 10
> n-1<26926> ssi:boot: opening module tm
> n-1<26926> ssi:boot: initializing module tm
> n-1<26926> ssi:boot:tm: not running under PBS
> n-1<26926> ssi:boot: module not available: tm
> n-1<26926> ssi:boot: finalizing module globus
> n-1<26926> ssi:boot:globus: finalizing
> n-1<26926> ssi:boot: closing module globus
> n-1<26926> ssi:boot: finalizing module tm
> n-1<26926> ssi:boot:tm: finalizing
> n-1<26926> ssi:boot: closing module tm
> n-1<26926> ssi:boot: Selected boot module rsh
> n-1<26923> ssi:boot:base:server: got connection from 128.112.176.29
> n-1<26923> ssi:boot:base:server: this connection is expected (n0)
> n-1<26923> ssi:boot:base:server: remote lamd is at 128.112.176.29:33247
> n-1<26923> ssi:boot:base:linear: booting n1 (storm.princeton.edu)
> n-1<26923> ssi:boot:rsh: starting lamd on (storm.princeton.edu)
> n-1<26923> ssi:boot:rsh: starting on n1 (storm.princeton.edu): hboot
> -t -c lam-conf.lamd -d -v -s -I "-H 128.112.176.29 -P 33559 -n 1 -o 0"
> n-1<26923> ssi:boot:rsh: launching remotely
> n-1<26923> ssi:boot:rsh: attempting to execute "ssh -x
> storm.princeton.edu -n echo $SHELL"
> n-1<26923> ssi:boot:rsh: remote shell appending ferret path
> /bin/tcsh
> n-1<26923> ssi:boot:rsh: attempting to execute "ssh -x
> storm.princeton.edu -n hboot -t -c lam-conf.lamd -d -v -s -I "-H
> 128.112.176.29 -P 33559 -n 1 -o 0""
> appending ferret path
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back:
> /tmp/lam-marcelo_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
> tkill: nothing to kill:
> "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
> hboot: performing tkill
> hboot: tkill -d
> hboot: booting...
> hboot: fork /home/splash/marcelo/bin/lam7_bin_pgi/bin/lamd
> [1] 31378 lamd -H 128.112.176.29 -P 33559 -n 1 -o 0 -d
> n-1<26923> ssi:boot:rsh: successfully launched on n1
> (storm.princeton.edu)
> n-1<26923> ssi:boot:base:server: expecting connection from finite list
> -----------------------------------------------------------------------
> ------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
>
> As far as LAM could tell, the remote process started properly, but
> then never called back. Possible reasons that this may happen:
>
> - There are network filters between the lamboot agent host and
> the remote host such that communication on random TCP ports
> is blocked
> - Network routing from the remote host to the local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output from "lamboot -d".
>
> 1. On the command line for hboot, there are two important parameters:
> one is the IP address of where the lamboot agent was invoked, the
> other is the port number that the lamboot agent is expecting the
> newly-booted process to call back on (this will be a random
> integer).
>
> 2. Manually login to the remote machine and try to telnet to the port
> indicated on the hboot command line. For example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection refused" error. If
> you get any other kind of error, it could indicate either of the
> two conditions above. Consult with your system/network
> administrator.
> -----------------------------------------------------------------------
> ------
> n-1<26923> ssi:boot:base:server: failed to connect to remote lamd!
> n-1<26923> ssi:boot:base:server: closing server socket
> n-1<26923> ssi:boot:base:linear: aborted!
> -----------------------------------------------------------------------
> ------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------
> ------
> n-1<26935> ssi:boot: Opening
> n-1<26935> ssi:boot: opening module globus
> n-1<26935> ssi:boot: initializing module globus
> n-1<26935> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<26935> ssi:boot: module not available: globus
> n-1<26935> ssi:boot: opening module rsh
> n-1<26935> ssi:boot: initializing module rsh
> n-1<26935> ssi:boot:rsh: module initializing
> n-1<26935> ssi:boot:rsh:agent: ssh -x
> n-1<26935> ssi:boot:rsh:username: <same>
> n-1<26935> ssi:boot:rsh:verbose: 1000
> n-1<26935> ssi:boot:rsh:algorithm: linear
> n-1<26935> ssi:boot:rsh:priority: 10
> n-1<26935> ssi:boot: module available: rsh, priority: 10
> n-1<26935> ssi:boot: opening module tm
> n-1<26935> ssi:boot: initializing module tm
> n-1<26935> ssi:boot:tm: not running under PBS
> n-1<26935> ssi:boot: module not available: tm
> n-1<26935> ssi:boot: finalizing module globus
> n-1<26935> ssi:boot:globus: finalizing
> n-1<26935> ssi:boot: closing module globus
> n-1<26935> ssi:boot: finalizing module tm
> n-1<26935> ssi:boot:tm: finalizing
> n-1<26935> ssi:boot: closing module tm
> n-1<26935> ssi:boot: Selected boot module rsh
> n-1<26935> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<26935> ssi:boot:base: <current directory>
> n-1<26935> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<26935> ssi:boot:base: $LAMHOME/etc
> n-1<26935> ssi:boot:base: /home/splash/marcelo/bin/lam7_bin_pgi/etc
> n-1<26935> ssi:boot:base: looking for boot schema file:
> n-1<26935> ssi:boot:base: hostfile
> n-1<26935> ssi:boot:base: found boot schema: hostfile
> n-1<26935> ssi:boot:rsh: found the following hosts:
> n-1<26935> ssi:boot:rsh: n0 labrador.princeton.edu (cpu=1)
> n-1<26935> ssi:boot:rsh: n1 storm.princeton.edu (cpu=2)
> n-1<26935> ssi:boot:rsh: resolved hosts:
> n-1<26935> ssi:boot:rsh: n0 labrador.princeton.edu -->
> 128.112.176.29 (origin)
> n-1<26935> ssi:boot:rsh: n1 storm.princeton.edu --> 128.112.177.153
> n-1<26935> ssi:boot:rsh: starting RTE procs
> n-1<26935> ssi:boot:base:linear: starting
> n-1<26935> ssi:boot:base:linear: booting n0 (labrador.princeton.edu)
> n-1<26935> ssi:boot:rsh: starting wipe on (labrador.princeton.edu)
> n-1<26935> ssi:boot:rsh: starting on n0 (labrador.princeton.edu):
> tkill -d -v
> n-1<26935> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back:
> /tmp/lam-marcelo_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 26926 ...
> tkill: killed
> tkill: all finished
> n-1<26935> ssi:boot:rsh: successfully launched on n0
> (labrador.princeton.edu)
> n-1<26935> ssi:boot:base:linear: booting n1 (storm.princeton.edu)
> n-1<26935> ssi:boot:rsh: starting wipe on (storm.princeton.edu)
> n-1<26935> ssi:boot:rsh: starting on n1 (storm.princeton.edu): tkill
> -d -v
> n-1<26935> ssi:boot:rsh: launching remotely
> n-1<26935> ssi:boot:rsh: attempting to execute "ssh -x
> storm.princeton.edu -n echo $SHELL"
> n-1<26935> ssi:boot:rsh: remote shell appending ferret path
> /bin/tcsh
> n-1<26935> ssi:boot:rsh: attempting to execute "ssh -x
> storm.princeton.edu -n tkill -d -v"
> appending ferret path
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back:
> /tmp/lam-marcelo_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-marcelo_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-marcelo_at_[hidden]/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 31378 ...
> tkill: already dead
> tkill: all finished
> n-1<26935> ssi:boot:rsh: successfully launched on n1
> (storm.princeton.edu)
> n-1<26935> ssi:boot:base:linear: finished
> n-1<26935> ssi:boot:rsh: all RTE procs started
> n-1<26935> ssi:boot:rsh: finalizing
> n-1<26935> ssi:boot: Closing
> lamboot did NOT complete successfully
>
>
>
>
>
>
> --
> Program in Atmospheric and Oceanic Sciences
> 205 Sayre Hall, Forrestal Campus
> Princeton University, Princeton, NJ 08544-0710
> Tel: (609) 258-1319 / Fax: (609) 258-2850
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/