LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Aditya Datey (avdatey_at_[hidden])
Date: 2005-02-11 12:51:19


Hi,

This is a "lamboot failed to open a socket at the newly booted process"
error. Hope someone can help!

Machine configs:
RH 8, p4s.
LAM 7.0.6 on all nodes.

recon finished with the woo hoo! msg.

Lamboot o/p below.
Among the things suggested:
1. is not the problem.
2,3 are probably not the problems, since I got the connection refused
error as directed.

The two machines in question are not physically located in the same
room. Lamboot successfully booted machines in the same room (connected
via standard ethernet hub). This & some previous posts lead me to
believe it might be a firewall problem.
But theres no firewall between the machines (no iptables running).

* How does one check if any other firewall exists?

* What could be the problem here?

lamhosts file
=============
memory.syr.edu (im using this as the source node)
coral.syr.edu

Telnet (26=coral)
==============
telnet 128.230.37.26 33229
Trying 128.230.37.26...
telnet: Unable to connect to remote host: Connection refused

I can ssh to the remote node (coral) without a password.

I also manually ran the hboot command on the remote node. Heres the o/p:
Manually running hboot
=======================

hboot -t -c lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o
0"
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-avdatey_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 8658 lamd -H 128.230.37.237 -P 40976 -n 1 -o 0 -d

But lam does not show as running even after this.
i.e ps aux | grep lam shows no o/p.

lamboot output
===============
lamboot -d lamhosts
n-1<16688> ssi:boot: Opening
n-1<16688> ssi:boot: opening module globus
n-1<16688> ssi:boot: initializing module globus
n-1<16688> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<16688> ssi:boot: module not available: globus
n-1<16688> ssi:boot: opening module rsh
n-1<16688> ssi:boot: initializing module rsh
n-1<16688> ssi:boot:rsh: module initializing
n-1<16688> ssi:boot:rsh:agent: ssh -x
n-1<16688> ssi:boot:rsh:username: <same>
n-1<16688> ssi:boot:rsh:verbose: 1000
n-1<16688> ssi:boot:rsh:algorithm: linear
n-1<16688> ssi:boot:rsh:priority: 10
n-1<16688> ssi:boot: module available: rsh, priority: 10
n-1<16688> ssi:boot: finalizing module globus
n-1<16688> ssi:boot:globus: finalizing
n-1<16688> ssi:boot: closing module globus
n-1<16688> ssi:boot: Selected boot module rsh

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<16688> ssi:boot:base: looking for boot schema in following
directories:
n-1<16688> ssi:boot:base: <current directory>
n-1<16688> ssi:boot:base: $TROLLIUSHOME/etc
n-1<16688> ssi:boot:base: $LAMHOME/etc
n-1<16688> ssi:boot:base: /usr/etc
n-1<16688> ssi:boot:base: looking for boot schema file:
n-1<16688> ssi:boot:base: lamhosts
n-1<16688> ssi:boot:base: found boot schema: lamhosts
n-1<16688> ssi:boot:rsh: found the following hosts:
n-1<16688> ssi:boot:rsh: n0 memory.syr.edu (cpu=1)
n-1<16688> ssi:boot:rsh: n1 coral.syr.edu (cpu=1)
n-1<16688> ssi:boot:rsh: resolved hosts:
n-1<16688> ssi:boot:rsh: n0 memory.syr.edu --> 128.230.37.237 (origin)
n-1<16688> ssi:boot:rsh: n1 coral.syr.edu --> 128.230.37.26
n-1<16688> ssi:boot:rsh: starting RTE procs
n-1<16688> ssi:boot:base:linear: starting
n-1<16688> ssi:boot:base:server: opening server TCP socket
n-1<16688> ssi:boot:base:server: opened port 40976
n-1<16688> ssi:boot:base:linear: booting n0 (memory.syr.edu)
n-1<16688> ssi:boot:rsh: starting lamd on (memory.syr.edu)
n-1<16688> ssi:boot:rsh: starting on n0 (memory.syr.edu): hboot -t -c
lam-conf.lamd -d -I -H 128.230.37.237 -P 40976 -n 0 -o 0
n-1<16688> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-avdatey_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 16691 lamd -H 128.230.37.237 -P 40976 -n 0 -o 0 -d
hboot: attempting to execute
n-1<16691> ssi:boot: Opening
n-1<16691> ssi:boot: opening module globus
n-1<16691> ssi:boot: initializing module globus
n-1<16691> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<16691> ssi:boot: module not available: globus
n-1<16691> ssi:boot: opening module rsh
n-1<16691> ssi:boot: initializing module rsh
n-1<16691> ssi:boot:rsh: module initializing
n-1<16691> ssi:boot:rsh:agent: ssh -x
n-1<16691> ssi:boot:rsh:username: <same>
n-1<16691> ssi:boot:rsh:verbose: 1000
n-1<16691> ssi:boot:rsh:algorithm: linear
n-1<16691> ssi:boot:rsh:priority: 10
n-1<16691> ssi:boot: module available: rsh, priority: 10
n-1<16691> ssi:boot: finalizing module globus
n-1<16691> ssi:boot:globus: finalizing
n-1<16691> ssi:boot: closing module globus
n-1<16691> ssi:boot: Selected boot module rsh
n-1<16688> ssi:boot:rsh: successfully launched on n0 (memory.syr.edu)
n-1<16688> ssi:boot:base:server: expecting connection from finite list
n-1<16688> ssi:boot:base:server: got connection from 128.230.37.237
n-1<16688> ssi:boot:base:server: this connection is expected (n0)
n-1<16688> ssi:boot:base:server: remote lamd is at 128.230.37.237:33050
n-1<16688> ssi:boot:base:linear: booting n1 (coral.syr.edu)
n-1<16688> ssi:boot:rsh: starting lamd on (coral.syr.edu)
n-1<16688> ssi:boot:rsh: starting on n1 (coral.syr.edu): hboot -t -c
lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o 0"
n-1<16688> ssi:boot:rsh: launching remotely
n-1<16688> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
echo $SHELL"
n-1<16688> ssi:boot:rsh: remote shell /bin/bash
n-1<16688> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
hboot -t -c lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o
0""
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-avdatey_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 8375 lamd -H 128.230.37.237 -P 40976 -n 1 -o 0 -d
n-1<16688> ssi:boot:rsh: successfully launched on n1 (coral.syr.edu)
n-1<16688> ssi:boot:base:server: expecting connection from finite list
n-1<16688> ssi:boot:base:server: got connection from 128.230.37.26
n-1<16688> ssi:boot:base:server: this connection is expected (n1)
n-1<16688> ssi:boot:base:server: remote lamd is at 128.230.37.26:32793
n-1<16688> ssi:boot:base:server: closing server socket
n-1<16688> ssi:boot:base:server: connecting to lamd at
128.230.37.237:40977
n-1<16688> ssi:boot:base:server: connected
n-1<16688> ssi:boot:base:server: sending number of links (2)
n-1<16688> ssi:boot:base:server: sending info: n0 (memory.syr.edu)
n-1<16688> ssi:boot:base:server: sending info: n1 (coral.syr.edu)
n-1<16688> ssi:boot:base:server: finished sending
n-1<16688> ssi:boot:base:server: disconnected from 128.230.37.237:40977
n-1<16688> ssi:boot:base:server: connecting to lamd at
128.230.37.26:33229
n-1<16691> ssi:boot:rsh: finalizing
n-1<16691> ssi:boot: Closing
-----------------------------------------------------------------------------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 128.230.37.26, port 33229.

Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process. As such, this may indicate:

        1. 128.230.37.26 is not the correct IP address for the machine
where the newly-booted machine was launched
        2. There are network filters between the lamboot agent host and
           the remote host such that communication on random TCP ports
           is blocked
        3. Network routing from the the local host to the remote isn't
           properly configured (this is unlikely)

For number 1, check to ensure that 128.230.37.26 is the correct IP
address for
that machine. If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 128.230.37.26 is both reachable and is
the by
the host where the lamboot agent is running, and is the correct host.

For numbers 2 and 4, try to telnet to 128.230.37.26, port 33229. You
should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port. If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
-----------------------------------------------------------------------------
n-1<16688> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n-1<16694> ssi:boot: Opening
n-1<16694> ssi:boot: opening module globus
n-1<16694> ssi:boot: initializing module globus
n-1<16694> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<16694> ssi:boot: module not available: globus
n-1<16694> ssi:boot: opening module rsh
n-1<16694> ssi:boot: initializing module rsh
n-1<16694> ssi:boot:rsh: module initializing
n-1<16694> ssi:boot:rsh:agent: ssh -x
n-1<16694> ssi:boot:rsh:username: <same>
n-1<16694> ssi:boot:rsh:verbose: 1000
n-1<16694> ssi:boot:rsh:algorithm: linear
n-1<16694> ssi:boot:rsh:priority: 10
n-1<16694> ssi:boot: module available: rsh, priority: 10
n-1<16694> ssi:boot: finalizing module globus
n-1<16694> ssi:boot:globus: finalizing
n-1<16694> ssi:boot: closing module globus
n-1<16694> ssi:boot: Selected boot module rsh
n-1<16694> ssi:boot:base: looking for boot schema in following
directories:
n-1<16694> ssi:boot:base: <current directory>
n-1<16694> ssi:boot:base: $TROLLIUSHOME/etc
n-1<16694> ssi:boot:base: $LAMHOME/etc
n-1<16694> ssi:boot:base: /usr/etc
n-1<16694> ssi:boot:base: looking for boot schema file:
n-1<16694> ssi:boot:base: lamhosts
n-1<16694> ssi:boot:base: found boot schema: lamhosts
n-1<16694> ssi:boot:rsh: found the following hosts:
n-1<16694> ssi:boot:rsh: n0 memory.syr.edu (cpu=1)
n-1<16694> ssi:boot:rsh: n1 coral.syr.edu (cpu=1)
n-1<16694> ssi:boot:rsh: resolved hosts:
n-1<16694> ssi:boot:rsh: n0 memory.syr.edu --> 128.230.37.237 (origin)
n-1<16694> ssi:boot:rsh: n1 coral.syr.edu --> 128.230.37.26
n-1<16694> ssi:boot:rsh: starting RTE procs
n-1<16694> ssi:boot:base:linear: starting
n-1<16694> ssi:boot:base:linear: booting n0 (memory.syr.edu)
n-1<16694> ssi:boot:rsh: starting wipe on (memory.syr.edu)
n-1<16694> ssi:boot:rsh: starting on n0 (memory.syr.edu): tkill -d
n-1<16694> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-avdatey_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 16691 ...
tkill: killed
tkill: all finished
n-1<16694> ssi:boot:rsh: successfully launched on n0 (memory.syr.edu)
n-1<16694> ssi:boot:base:linear: booting n1 (coral.syr.edu)
n-1<16694> ssi:boot:rsh: starting wipe on (coral.syr.edu)
n-1<16694> ssi:boot:rsh: starting on n1 (coral.syr.edu): tkill -d
n-1<16694> ssi:boot:rsh: launching remotely
n-1<16694> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
echo $SHELL"
n-1<16694> ssi:boot:rsh: remote shell /bin/bash
n-1<16694> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
tkill -d"
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file:
/tmp/lam-avdatey_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 8375 ...
tkill: killed
tkill: all finished
n-1<16694> ssi:boot:rsh: successfully launched on n1 (coral.syr.edu)
n-1<16694> ssi:boot:base:linear: finished
n-1<16694> ssi:boot:rsh: all RTE procs started
n-1<16694> ssi:boot:rsh: finalizing
n-1<16694> ssi:boot: Closing
lamboot did NOT complete successfully

Thanks,
Aditya