LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Aditya Datey (avdatey_at_[hidden])
Date: 2005-02-13 13:05:27


Hi,

> - Did you confirm that LAM was getting the right IP address for coral?
Yes , in a way that none of the messages show something like 127.0.0.01
which i found was a common error on the list archives. All the messages
show the correct ip for that machine.

> - You might want to check with local system administrators to see if
> any firewalls are in place between the machines (e.g., at the router or
> switch level)
I checked with the friendly neighbourhood sysadmin, and got it that
there was nothing that would prevent opening of ssh on random sockets.

This is verified by the fact that I can boot LAM successfully on 4 of
the 10 machines Im trying to get it working on.
Now the 4 working machines are a heterogenous mix, kernel and RH version
wise. But all run LAM 7.0.6. None of the machines are older than RH8,
and most of them have the 2.4.22 linux kernel.
Now when I compiled the kernels for the machines, it is possible that I
selected different things (to get the sound card working etc.) on the
machines.

** So one reason I can think of for why it is working on some but not
all machines, is that LAM needs something in the kernel that I did not
put in ??

Thanks,
Aditya

On Sat, 2005-02-12 at 08:54, Jeff Squyres wrote:
> This is quite an odd (but not unheard of) error. What is happening is:
>
> Time Memory Coral
> 0 ssh coral
> 1 hboot executes
> 2 lamd executes
> 3 lamd opens socket to lamboot
> 4 lamboot and lamd exchange information, close socket
> 5 lamboot tries to open socket to lamd
>
> It is #5 that fails, so it's odd that #4 succeeds (i.e., we can open a
> socket one way but can't open a socket the other way).
>
> - Did you confirm that LAM was getting the right IP address for coral?
> - You might want to check with local system administrators to see if
> any firewalls are in place between the machines (e.g., at the router or
> switch level)
>
>
> On Feb 11, 2005, at 12:51 PM, Aditya Datey wrote:
>
> > Hi,
> >
> > This is a "lamboot failed to open a socket at the newly booted process"
> > error. Hope someone can help!
> >
> > Machine configs:
> > RH 8, p4s.
> > LAM 7.0.6 on all nodes.
> >
> > recon finished with the woo hoo! msg.
> >
> > Lamboot o/p below.
> > Among the things suggested:
> > 1. is not the problem.
> > 2,3 are probably not the problems, since I got the connection refused
> > error as directed.
> >
> > The two machines in question are not physically located in the same
> > room. Lamboot successfully booted machines in the same room (connected
> > via standard ethernet hub). This & some previous posts lead me to
> > believe it might be a firewall problem.
> > But theres no firewall between the machines (no iptables running).
> >
> > * How does one check if any other firewall exists?
> >
> > * What could be the problem here?
> >
> >
> >
> > lamhosts file
> > =============
> > memory.syr.edu (im using this as the source node)
> > coral.syr.edu
> >
> >
> >
> > Telnet (26=coral)
> > ==============
> > telnet 128.230.37.26 33229
> > Trying 128.230.37.26...
> > telnet: Unable to connect to remote host: Connection refused
> >
> > I can ssh to the remote node (coral) without a password.
> >
> > I also manually ran the hboot command on the remote node. Heres the
> > o/p:
> > Manually running hboot
> > =======================
> >
> > hboot -t -c lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o
> > 0"
> > hboot: performing tkill
> > hboot: tkill -d
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file:
> > /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> > tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > [1] 8658 lamd -H 128.230.37.237 -P 40976 -n 1 -o 0 -d
> >
> >
> > But lam does not show as running even after this.
> > i.e ps aux | grep lam shows no o/p.
> >
> >
> > lamboot output
> > ===============
> > lamboot -d lamhosts
> > n-1<16688> ssi:boot: Opening
> > n-1<16688> ssi:boot: opening module globus
> > n-1<16688> ssi:boot: initializing module globus
> > n-1<16688> ssi:boot:globus: globus-job-run not found, globus boot will
> > not run
> > n-1<16688> ssi:boot: module not available: globus
> > n-1<16688> ssi:boot: opening module rsh
> > n-1<16688> ssi:boot: initializing module rsh
> > n-1<16688> ssi:boot:rsh: module initializing
> > n-1<16688> ssi:boot:rsh:agent: ssh -x
> > n-1<16688> ssi:boot:rsh:username: <same>
> > n-1<16688> ssi:boot:rsh:verbose: 1000
> > n-1<16688> ssi:boot:rsh:algorithm: linear
> > n-1<16688> ssi:boot:rsh:priority: 10
> > n-1<16688> ssi:boot: module available: rsh, priority: 10
> > n-1<16688> ssi:boot: finalizing module globus
> > n-1<16688> ssi:boot:globus: finalizing
> > n-1<16688> ssi:boot: closing module globus
> > n-1<16688> ssi:boot: Selected boot module rsh
> >
> > LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
> >
> > n-1<16688> ssi:boot:base: looking for boot schema in following
> > directories:
> > n-1<16688> ssi:boot:base: <current directory>
> > n-1<16688> ssi:boot:base: $TROLLIUSHOME/etc
> > n-1<16688> ssi:boot:base: $LAMHOME/etc
> > n-1<16688> ssi:boot:base: /usr/etc
> > n-1<16688> ssi:boot:base: looking for boot schema file:
> > n-1<16688> ssi:boot:base: lamhosts
> > n-1<16688> ssi:boot:base: found boot schema: lamhosts
> > n-1<16688> ssi:boot:rsh: found the following hosts:
> > n-1<16688> ssi:boot:rsh: n0 memory.syr.edu (cpu=1)
> > n-1<16688> ssi:boot:rsh: n1 coral.syr.edu (cpu=1)
> > n-1<16688> ssi:boot:rsh: resolved hosts:
> > n-1<16688> ssi:boot:rsh: n0 memory.syr.edu --> 128.230.37.237
> > (origin)
> > n-1<16688> ssi:boot:rsh: n1 coral.syr.edu --> 128.230.37.26
> > n-1<16688> ssi:boot:rsh: starting RTE procs
> > n-1<16688> ssi:boot:base:linear: starting
> > n-1<16688> ssi:boot:base:server: opening server TCP socket
> > n-1<16688> ssi:boot:base:server: opened port 40976
> > n-1<16688> ssi:boot:base:linear: booting n0 (memory.syr.edu)
> > n-1<16688> ssi:boot:rsh: starting lamd on (memory.syr.edu)
> > n-1<16688> ssi:boot:rsh: starting on n0 (memory.syr.edu): hboot -t -c
> > lam-conf.lamd -d -I -H 128.230.37.237 -P 40976 -n 0 -o 0
> > n-1<16688> ssi:boot:rsh: launching locally
> > hboot: performing tkill
> > hboot: tkill -d
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file:
> > /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> > tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > [1] 16691 lamd -H 128.230.37.237 -P 40976 -n 0 -o 0 -d
> > hboot: attempting to execute
> > n-1<16691> ssi:boot: Opening
> > n-1<16691> ssi:boot: opening module globus
> > n-1<16691> ssi:boot: initializing module globus
> > n-1<16691> ssi:boot:globus: globus-job-run not found, globus boot will
> > not run
> > n-1<16691> ssi:boot: module not available: globus
> > n-1<16691> ssi:boot: opening module rsh
> > n-1<16691> ssi:boot: initializing module rsh
> > n-1<16691> ssi:boot:rsh: module initializing
> > n-1<16691> ssi:boot:rsh:agent: ssh -x
> > n-1<16691> ssi:boot:rsh:username: <same>
> > n-1<16691> ssi:boot:rsh:verbose: 1000
> > n-1<16691> ssi:boot:rsh:algorithm: linear
> > n-1<16691> ssi:boot:rsh:priority: 10
> > n-1<16691> ssi:boot: module available: rsh, priority: 10
> > n-1<16691> ssi:boot: finalizing module globus
> > n-1<16691> ssi:boot:globus: finalizing
> > n-1<16691> ssi:boot: closing module globus
> > n-1<16691> ssi:boot: Selected boot module rsh
> > n-1<16688> ssi:boot:rsh: successfully launched on n0 (memory.syr.edu)
> > n-1<16688> ssi:boot:base:server: expecting connection from finite list
> > n-1<16688> ssi:boot:base:server: got connection from 128.230.37.237
> > n-1<16688> ssi:boot:base:server: this connection is expected (n0)
> > n-1<16688> ssi:boot:base:server: remote lamd is at 128.230.37.237:33050
> > n-1<16688> ssi:boot:base:linear: booting n1 (coral.syr.edu)
> > n-1<16688> ssi:boot:rsh: starting lamd on (coral.syr.edu)
> > n-1<16688> ssi:boot:rsh: starting on n1 (coral.syr.edu): hboot -t -c
> > lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o 0"
> > n-1<16688> ssi:boot:rsh: launching remotely
> > n-1<16688> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> > echo $SHELL"
> > n-1<16688> ssi:boot:rsh: remote shell /bin/bash
> > n-1<16688> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> > hboot -t -c lam-conf.lamd -d -s -I "-H 128.230.37.237 -P 40976 -n 1 -o
> > 0""
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file:
> > /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> > tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > tkill: nothing to kill: "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > hboot: performing tkill
> > hboot: tkill -d
> > hboot: booting...
> > hboot: fork /usr/bin/lamd
> > [1] 8375 lamd -H 128.230.37.237 -P 40976 -n 1 -o 0 -d
> > n-1<16688> ssi:boot:rsh: successfully launched on n1 (coral.syr.edu)
> > n-1<16688> ssi:boot:base:server: expecting connection from finite list
> > n-1<16688> ssi:boot:base:server: got connection from 128.230.37.26
> > n-1<16688> ssi:boot:base:server: this connection is expected (n1)
> > n-1<16688> ssi:boot:base:server: remote lamd is at 128.230.37.26:32793
> > n-1<16688> ssi:boot:base:server: closing server socket
> > n-1<16688> ssi:boot:base:server: connecting to lamd at
> > 128.230.37.237:40977
> > n-1<16688> ssi:boot:base:server: connected
> > n-1<16688> ssi:boot:base:server: sending number of links (2)
> > n-1<16688> ssi:boot:base:server: sending info: n0 (memory.syr.edu)
> > n-1<16688> ssi:boot:base:server: sending info: n1 (coral.syr.edu)
> > n-1<16688> ssi:boot:base:server: finished sending
> > n-1<16688> ssi:boot:base:server: disconnected from 128.230.37.237:40977
> > n-1<16688> ssi:boot:base:server: connecting to lamd at
> > 128.230.37.26:33229
> > n-1<16691> ssi:boot:rsh: finalizing
> > n-1<16691> ssi:boot: Closing
> > -----------------------------------------------------------------------
> > ------
> > The lamboot agent failed to open a client socket to the newly-booted
> > process at IP address 128.230.37.26, port 33229.
> >
> > Although the newly-booted process has already communicated
> > successfully with the lamboot agent over other TCP sockets, this is
> > the first time that the lamboot agent tried to initiate a connection
> > to the newly-booted process. As such, this may indicate:
> >
> > 1. 128.230.37.26 is not the correct IP address for the machine
> > where the newly-booted machine was launched
> > 2. There are network filters between the lamboot agent host and
> > the remote host such that communication on random TCP ports
> > is blocked
> > 3. Network routing from the the local host to the remote isn't
> > properly configured (this is unlikely)
> >
> > For number 1, check to ensure that 128.230.37.26 is the correct IP
> > address for
> > that machine. If it is not, check the host mapping on that machine
> > (e.g., /etc/hosts) to ensure that 128.230.37.26 is both reachable and
> > is
> > the by
> > the host where the lamboot agent is running, and is the correct host.
> >
> > For numbers 2 and 4, try to telnet to 128.230.37.26, port 33229. You
> > should get a
> > "connection refused" error, which will indicate that you successfully
> > connected to some machine at that IP address, and no process was
> > listening on that port. If you get any other kind of error, check
> > with your system/network administrator -- it may indicate network /
> > routing issues between the two hosts.
> > -----------------------------------------------------------------------
> > ------
> > n-1<16688> ssi:boot:base:linear: aborted!
> > -----------------------------------------------------------------------
> > ------
> > lamboot encountered some error (see above) during the boot process,
> > and will now attempt to kill all nodes that it was previously able to
> > boot (if any).
> >
> > Please wait for LAM to finish; if you interrupt this process, you may
> > have LAM daemons still running on remote nodes.
> > -----------------------------------------------------------------------
> > ------
> > n-1<16694> ssi:boot: Opening
> > n-1<16694> ssi:boot: opening module globus
> > n-1<16694> ssi:boot: initializing module globus
> > n-1<16694> ssi:boot:globus: globus-job-run not found, globus boot will
> > not run
> > n-1<16694> ssi:boot: module not available: globus
> > n-1<16694> ssi:boot: opening module rsh
> > n-1<16694> ssi:boot: initializing module rsh
> > n-1<16694> ssi:boot:rsh: module initializing
> > n-1<16694> ssi:boot:rsh:agent: ssh -x
> > n-1<16694> ssi:boot:rsh:username: <same>
> > n-1<16694> ssi:boot:rsh:verbose: 1000
> > n-1<16694> ssi:boot:rsh:algorithm: linear
> > n-1<16694> ssi:boot:rsh:priority: 10
> > n-1<16694> ssi:boot: module available: rsh, priority: 10
> > n-1<16694> ssi:boot: finalizing module globus
> > n-1<16694> ssi:boot:globus: finalizing
> > n-1<16694> ssi:boot: closing module globus
> > n-1<16694> ssi:boot: Selected boot module rsh
> > n-1<16694> ssi:boot:base: looking for boot schema in following
> > directories:
> > n-1<16694> ssi:boot:base: <current directory>
> > n-1<16694> ssi:boot:base: $TROLLIUSHOME/etc
> > n-1<16694> ssi:boot:base: $LAMHOME/etc
> > n-1<16694> ssi:boot:base: /usr/etc
> > n-1<16694> ssi:boot:base: looking for boot schema file:
> > n-1<16694> ssi:boot:base: lamhosts
> > n-1<16694> ssi:boot:base: found boot schema: lamhosts
> > n-1<16694> ssi:boot:rsh: found the following hosts:
> > n-1<16694> ssi:boot:rsh: n0 memory.syr.edu (cpu=1)
> > n-1<16694> ssi:boot:rsh: n1 coral.syr.edu (cpu=1)
> > n-1<16694> ssi:boot:rsh: resolved hosts:
> > n-1<16694> ssi:boot:rsh: n0 memory.syr.edu --> 128.230.37.237
> > (origin)
> > n-1<16694> ssi:boot:rsh: n1 coral.syr.edu --> 128.230.37.26
> > n-1<16694> ssi:boot:rsh: starting RTE procs
> > n-1<16694> ssi:boot:base:linear: starting
> > n-1<16694> ssi:boot:base:linear: booting n0 (memory.syr.edu)
> > n-1<16694> ssi:boot:rsh: starting wipe on (memory.syr.edu)
> > n-1<16694> ssi:boot:rsh: starting on n0 (memory.syr.edu): tkill -d
> > n-1<16694> ssi:boot:rsh: launching locally
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file:
> > /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> > tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > tkill: killing LAM...
> > tkill: killing PID (SIGHUP) 16691 ...
> > tkill: killed
> > tkill: all finished
> > n-1<16694> ssi:boot:rsh: successfully launched on n0 (memory.syr.edu)
> > n-1<16694> ssi:boot:base:linear: booting n1 (coral.syr.edu)
> > n-1<16694> ssi:boot:rsh: starting wipe on (coral.syr.edu)
> > n-1<16694> ssi:boot:rsh: starting on n1 (coral.syr.edu): tkill -d
> > n-1<16694> ssi:boot:rsh: launching remotely
> > n-1<16694> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> > echo $SHELL"
> > n-1<16694> ssi:boot:rsh: remote shell /bin/bash
> > n-1<16694> ssi:boot:rsh: attempting to execute "ssh -x coral.syr.edu -n
> > tkill -d"
> > tkill: setting prefix to (null)
> > tkill: setting suffix to (null)
> > tkill: got killname back: /tmp/lam-avdatey_at_[hidden]/lam-killfile
> > tkill: removing socket file ...
> > tkill: socket file: /tmp/lam-avdatey_at_[hidden]/lam-kernel-socketd
> > tkill: removing IO daemon socket file ...
> > tkill: IO daemon socket file:
> > /tmp/lam-avdatey_at_[hidden]/lam-io-socket
> > tkill: f_kill = "/tmp/lam-avdatey_at_[hidden]/lam-killfile"
> > tkill: killing LAM...
> > tkill: killing PID (SIGHUP) 8375 ...
> > tkill: killed
> > tkill: all finished
> > n-1<16694> ssi:boot:rsh: successfully launched on n1 (coral.syr.edu)
> > n-1<16694> ssi:boot:base:linear: finished
> > n-1<16694> ssi:boot:rsh: all RTE procs started
> > n-1<16694> ssi:boot:rsh: finalizing
> > n-1<16694> ssi:boot: Closing
> > lamboot did NOT complete successfully
> >
> >
> > Thanks,
> > Aditya
> >
> >
> >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/