LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-02-12 08:54:37


This is quite an odd (but not unheard of) error. What is happening is:

Time Memory Coral
0 ssh coral
1 hboot executes
2 lamd executes
3 lamd opens socket to lamboot
4 lamboot and lamd exchange information, close socket
5 lamboot tries to open socket to lamd

It is #5 that fails, so it's odd that #4 succeeds (i.e., we can open a
socket one way but can't open a socket the other way).

- Did you confirm that LAM was getting the right IP address for coral?
- You might want to check with local system administrators to see if
any firewalls are in place between the machines (e.g., at the router or
switch level)

On Feb 11, 2005, at 12:51 PM, Aditya Datey wrote:

> Hi,
>
> This is a "lamboot failed to open a socket at the newly booted process"
> error. Hope someone can help!
>
> Machine configs:
> RH 8, p4s.
> LAM 7.0.6 on all nodes.
>
> recon finished with the woo hoo! msg.
>
> Lamboot o/p below.
> Among the things suggested:
> 1. is not the problem.
> 2,3 are probably not the problems, since I got the connection refused
> error as directed.
>
> The two machines in question are not physically located in the same
> room. Lamboot successfully booted machines in the same room (connected
> via standard ethernet hub). This & some previous posts lead me to
> believe it might be a firewall problem.
> But theres no firewall between the machines (no iptables running).
>
> * How does one check if any other firewall exists?
>
> * What could be the problem here?
>
>
>
> lamhosts file
> =============
> memory.syr.edu (im using this as the source node)
> coral.syr.edu
>
>
>
> Telnet (26=coral)
> ==============
> telnet 128.230.37.26 33229
> Trying 128.230.37.26...
> telnet: Unable to connect to remote host: Connection refused
>
> I can ssh to the remote node (coral) without a password.
>
> I also manually ran the hboot command on the remote node. Heres the
> o/p:
> Manually running hboot
> =======================
>
> hboot -t -c lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o
> 0"
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 8658 lamd -H 128.230.37.237 -P 40976 -n 1 -o 0 -d
>
>
> But lam does not show as running even after this.
> i.e ps aux | grep lam shows no o/p.
>
>
> lamboot output
> ===============
> lamboot -d lamhosts
> n-1<16688> ssi:boot: Opening
> n-1<16688> ssi:boot: opening module globus
> n-1<16688> ssi:boot: initializing module globus
> n-1<16688> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<16688> ssi:boot: module not available: globus
> n-1<16688> ssi:boot: opening module rsh
> n-1<16688> ssi:boot: initializing module rsh
> n-1<16688> ssi:boot:rsh: module initializing
> n-1<16688> ssi:boot:rsh:agent: ssh -x
> n-1<16688> ssi:boot:rsh:username: <same>
> n-1<16688> ssi:boot:rsh:verbose: 1000
> n-1<16688> ssi:boot:rsh:algorithm: linear
> n-1<16688> ssi:boot:rsh:priority: 10
> n-1<16688> ssi:boot: module available: rsh, priority: 10
> n-1<16688> ssi:boot: finalizing module globus
> n-1<16688> ssi:boot:globus: finalizing
> n-1<16688> ssi:boot: closing module globus
> n-1<16688> ssi:boot: Selected boot module rsh
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<16688> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<16688> ssi:boot:base: <current directory>
> n-1<16688> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<16688> ssi:boot:base: $LAMHOME/etc
> n-1<16688> ssi:boot:base: /usr/etc
> n-1<16688> ssi:boot:base: looking for boot schema file:
> n-1<16688> ssi:boot:base: lamhosts
> n-1<16688> ssi:boot:base: found boot schema: lamhosts
> n-1<16688> ssi:boot:rsh: found the following hosts:
> n-1<16688> ssi:boot:rsh: n0 memory.syr.edu (cpu=1)
> n-1<16688> ssi:boot:rsh: n1 coral.syr.edu (cpu=1)
> n-1<16688> ssi:boot:rsh: resolved hosts:
> n-1<16688> ssi:boot:rsh: n0 memory.syr.edu --> 128.230.37.237
> (origin)
> n-1<16688> ssi:boot:rsh: n1 coral.syr.edu --> 128.230.37.26
> n-1<16688> ssi:boot:rsh: starting RTE procs
> n-1<16688> ssi:boot:base:linear: starting
> n-1<16688> ssi:boot:base:server: opening server TCP socket
> n-1<16688> ssi:boot:base:server: opened port 40976
> n-1<16688> ssi:boot:base:linear: booting n0 (memory.syr.edu)
> n-1<16688> ssi:boot:rsh: starting lamd on (memory.syr.edu)
> n-1<16688> ssi:boot:rsh: starting on n0 (memory.syr.edu): hboot -t -c
> lam-conf.lamd -d -I -H 128.230.37.237 -P 40976 -n 0 -o 0
> n-1<16688> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 16691 lamd -H 128.230.37.237 -P 40976 -n 0 -o 0 -d
> hboot: attempting to execute
> n-1<16691> ssi:boot: Opening
> n-1<16691> ssi:boot: opening module globus
> n-1<16691> ssi:boot: initializing module globus
> n-1<16691> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<16691> ssi:boot: module not available: globus
> n-1<16691> ssi:boot: opening module rsh
> n-1<16691> ssi:boot: initializing module rsh
> n-1<16691> ssi:boot:rsh: module initializing
> n-1<16691> ssi:boot:rsh:agent: ssh -x
> n-1<16691> ssi:boot:rsh:username: <same>
> n-1<16691> ssi:boot:rsh:verbose: 1000
> n-1<16691> ssi:boot:rsh:algorithm: linear
> n-1<16691> ssi:boot:rsh:priority: 10
> n-1<16691> ssi:boot: module available: rsh, priority: 10
> n-1<16691> ssi:boot: finalizing module globus
> n-1<16691> ssi:boot:globus: finalizing
> n-1<16691> ssi:boot: closing module globus
> n-1<16691> ssi:boot: Selected boot module rsh
> n-1<16688> ssi:boot:rsh: successfully launched on n0 (memory.syr.edu)
> n-1<16688> ssi:boot:base:server: expecting connection from finite list
> n-1<16688> ssi:boot:base:server: got connection from 128.230.37.237
> n-1<16688> ssi:boot:base:server: this connection is expected (n0)
> n-1<16688> ssi:boot:base:server: remote lamd is at 128.230.37.237:33050
> n-1<16688> ssi:boot:base:linear: booting n1 (coral.syr.edu)
> n-1<16688> ssi:boot:rsh: starting lamd on (coral.syr.edu)
> n-1<16688> ssi:boot:rsh: starting on n1 (coral.syr.edu): hboot -t -c
> lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o 0"
> n-1<16688> ssi:boot:rsh: launching remotely
> n-1<16688> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> echo $SHELL"
> n-1<16688> ssi:boot:rsh: remote shell /bin/bash
> n-1<16688> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> hboot -t -c lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o
> 0""
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> hboot: performing tkill
> hboot: tkill -d
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1] 8375 lamd -H 128.230.37.237 -P 40976 -n 1 -o 0 -d
> n-1<16688> ssi:boot:rsh: successfully launched on n1 (coral.syr.edu)
> n-1<16688> ssi:boot:base:server: expecting connection from finite list
> n-1<16688> ssi:boot:base:server: got connection from 128.230.37.26
> n-1<16688> ssi:boot:base:server: this connection is expected (n1)
> n-1<16688> ssi:boot:base:server: remote lamd is at 128.230.37.26:32793
> n-1<16688> ssi:boot:base:server: closing server socket
> n-1<16688> ssi:boot:base:server: connecting to lamd at
> 128.230.37.237:40977
> n-1<16688> ssi:boot:base:server: connected
> n-1<16688> ssi:boot:base:server: sending number of links (2)
> n-1<16688> ssi:boot:base:server: sending info: n0 (memory.syr.edu)
> n-1<16688> ssi:boot:base:server: sending info: n1 (coral.syr.edu)
> n-1<16688> ssi:boot:base:server: finished sending
> n-1<16688> ssi:boot:base:server: disconnected from 128.230.37.237:40977
> n-1<16688> ssi:boot:base:server: connecting to lamd at
> 128.230.37.26:33229
> n-1<16691> ssi:boot:rsh: finalizing
> n-1<16691> ssi:boot: Closing
> -----------------------------------------------------------------------
> ------
> The lamboot agent failed to open a client socket to the newly-booted
> process at IP address 128.230.37.26, port 33229.
>
> Although the newly-booted process has already communicated
> successfully with the lamboot agent over other TCP sockets, this is
> the first time that the lamboot agent tried to initiate a connection
> to the newly-booted process. As such, this may indicate:
>
> 1. 128.230.37.26 is not the correct IP address for the machine
> where the newly-booted machine was launched
> 2. There are network filters between the lamboot agent host and
> the remote host such that communication on random TCP ports
> is blocked
> 3. Network routing from the the local host to the remote isn't
> properly configured (this is unlikely)
>
> For number 1, check to ensure that 128.230.37.26 is the correct IP
> address for
> that machine. If it is not, check the host mapping on that machine
> (e.g., /etc/hosts) to ensure that 128.230.37.26 is both reachable and
> is
> the by
> the host where the lamboot agent is running, and is the correct host.
>
> For numbers 2 and 4, try to telnet to 128.230.37.26, port 33229. You
> should get a
> "connection refused" error, which will indicate that you successfully
> connected to some machine at that IP address, and no process was
> listening on that port. If you get any other kind of error, check
> with your system/network administrator -- it may indicate network /
> routing issues between the two hosts.
> -----------------------------------------------------------------------
> ------
> n-1<16688> ssi:boot:base:linear: aborted!
> -----------------------------------------------------------------------
> ------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------
> ------
> n-1<16694> ssi:boot: Opening
> n-1<16694> ssi:boot: opening module globus
> n-1<16694> ssi:boot: initializing module globus
> n-1<16694> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<16694> ssi:boot: module not available: globus
> n-1<16694> ssi:boot: opening module rsh
> n-1<16694> ssi:boot: initializing module rsh
> n-1<16694> ssi:boot:rsh: module initializing
> n-1<16694> ssi:boot:rsh:agent: ssh -x
> n-1<16694> ssi:boot:rsh:username: <same>
> n-1<16694> ssi:boot:rsh:verbose: 1000
> n-1<16694> ssi:boot:rsh:algorithm: linear
> n-1<16694> ssi:boot:rsh:priority: 10
> n-1<16694> ssi:boot: module available: rsh, priority: 10
> n-1<16694> ssi:boot: finalizing module globus
> n-1<16694> ssi:boot:globus: finalizing
> n-1<16694> ssi:boot: closing module globus
> n-1<16694> ssi:boot: Selected boot module rsh
> n-1<16694> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<16694> ssi:boot:base: <current directory>
> n-1<16694> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<16694> ssi:boot:base: $LAMHOME/etc
> n-1<16694> ssi:boot:base: /usr/etc
> n-1<16694> ssi:boot:base: looking for boot schema file:
> n-1<16694> ssi:boot:base: lamhosts
> n-1<16694> ssi:boot:base: found boot schema: lamhosts
> n-1<16694> ssi:boot:rsh: found the following hosts:
> n-1<16694> ssi:boot:rsh: n0 memory.syr.edu (cpu=1)
> n-1<16694> ssi:boot:rsh: n1 coral.syr.edu (cpu=1)
> n-1<16694> ssi:boot:rsh: resolved hosts:
> n-1<16694> ssi:boot:rsh: n0 memory.syr.edu --> 128.230.37.237
> (origin)
> n-1<16694> ssi:boot:rsh: n1 coral.syr.edu --> 128.230.37.26
> n-1<16694> ssi:boot:rsh: starting RTE procs
> n-1<16694> ssi:boot:base:linear: starting
> n-1<16694> ssi:boot:base:linear: booting n0 (memory.syr.edu)
> n-1<16694> ssi:boot:rsh: starting wipe on (memory.syr.edu)
> n-1<16694> ssi:boot:rsh: starting on n0 (memory.syr.edu): tkill -d
> n-1<16694> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 16691 ...
> tkill: killed
> tkill: all finished
> n-1<16694> ssi:boot:rsh: successfully launched on n0 (memory.syr.edu)
> n-1<16694> ssi:boot:base:linear: booting n1 (coral.syr.edu)
> n-1<16694> ssi:boot:rsh: starting wipe on (coral.syr.edu)
> n-1<16694> ssi:boot:rsh: starting on n1 (coral.syr.edu): tkill -d
> n-1<16694> ssi:boot:rsh: launching remotely
> n-1<16694> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> echo $SHELL"
> n-1<16694> ssi:boot:rsh: remote shell /bin/bash
> n-1<16694> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> tkill -d"
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 8375 ...
> tkill: killed
> tkill: all finished
> n-1<16694> ssi:boot:rsh: successfully launched on n1 (coral.syr.edu)
> n-1<16694> ssi:boot:base:linear: finished
> n-1<16694> ssi:boot:rsh: all RTE procs started
> n-1<16694> ssi:boot:rsh: finalizing
> n-1<16694> ssi:boot: Closing
> lamboot did NOT complete successfully
>
>
> Thanks,
> Aditya
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/