LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-02-03 03:11:09


Do you have multiple TCP NICs in xeon3?

It looks like the node you lambooted from (xeon2) successfully
launched the right commands on xeon3 and waiting for the newly-
started processes to open a TCP socket back to it. However, the
connection came from an unexpected IP address (0.39 instead of
0.38). So lamboot dropped the connection, but then eventually timed
out because it never got the connection that it expected.

If you have multiple NICs in xeon2, there are several ways to solve
this problem. However, if all your networks are distinct, then only
use the hostnames/IP addresses on a single network. This is by far
the easiest solution.

Let me know if you can't do that, and we can explore other solutions.

On Feb 2, 2006, at 10:21 AM, Javier Martínez de Pisón Ascacíbar wrote:

> Hi, LAM comunity
>
> I have a lamboot problem with LAM v7.0.4 (ABAQUS 6.5.1 needs this
> version) in 3 machines (xeon2, xeon3, apidell) with SUSE 9.3 to run
> ABAQUS 6.5.1.
>
> I think, LAM is correctly installed in the 3 machines.
>
> Access using "rsh" is running ok in any directions without querying
> password.
>
> Firewalls (iptables) are stopped.
>
> But, "lamboot" doesn't start. ¿What could I do? ¿What is happening?
>
> For example, i have tried "telnet 192.168.0.28 4271" from xeon3 and it
> seems to work fine… I have gotten "connection refused" error.
>
> This is my lamboot report
>
>> lamboot -v hostpar -d
> n0<19818> ssi:boot: Opening
> n0<19818> ssi:boot: opening module globus
> n0<19818> ssi:boot: initializing module globus
> n0<19818> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n0<19818> ssi:boot: module not available: globus
> n0<19818> ssi:boot: opening module rsh
> n0<19818> ssi:boot: initializing module rsh
> n0<19818> ssi:boot:rsh: module initializing
> n0<19818> ssi:boot:rsh:agent: rsh
> n0<19818> ssi:boot:rsh:username: <same>
> n0<19818> ssi:boot:rsh:verbose: 1000
> n0<19818> ssi:boot:rsh:algorithm: linear
> n0<19818> ssi:boot:rsh:priority: 10
> n0<19818> ssi:boot: module available: rsh, priority: 10
> n0<19818> ssi:boot: finalizing module globus
> n0<19818> ssi:boot:globus: finalizing
> n0<19818> ssi:boot: closing module globus
> n0<19818> ssi:boot: Selected boot module rsh
>
> LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University
>
> n0<19818> ssi:boot:base: looking for boot schema in following
> directories:
> n0<19818> ssi:boot:base: <current directory>
> n0<19818> ssi:boot:base: $TROLLIUSHOME/etc
> n0<19818> ssi:boot:base: $LAMHOME/etc
> n0<19818> ssi:boot:base: /opt/lam/etc
> n0<19818> ssi:boot:base: looking for boot schema file:
> n0<19818> ssi:boot:base: hostpar
> n0<19818> ssi:boot:base: found boot schema: hostpar
> n0<19818> ssi:boot:rsh: found the following hosts:
> n0<19818> ssi:boot:rsh: n0 xeon2 (cpu=2)
> n0<19818> ssi:boot:rsh: n1 xeon3 (cpu=2)
> n0<19818> ssi:boot:rsh: n2 apidell (cpu=2)
> n0<19818> ssi:boot:rsh: resolved hosts:
> n0<19818> ssi:boot:rsh: n0 xeon2 --> 192.168.0.28 (origin)
> n0<19818> ssi:boot:rsh: n1 xeon3 --> 192.168.0.38
> n0<19818> ssi:boot:rsh: n2 apidell --> 192.168.0.220
> n0<19818> ssi:boot:rsh: starting RTE procs
> n0<19818> ssi:boot:base:linear: starting
> n0<19818> ssi:boot:base:server: opening server TCP socket
> n0<19818> ssi:boot:base:server: opened port 4271
> n0<19818> ssi:boot:base:linear: booting n0 (xeon2)
> n0<19818> ssi:boot:rsh: starting lamd on (xeon2)
> n0<19818> ssi:boot:rsh: starting on n0 (xeon2): hboot -t -c
> lam-conf.lamd -d -v -I -H 192.168.0.28 -P 4271 -n 0 -o 0
> n0<19818> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-abaquspar_at_xeon2/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-abaquspar_at_xeon2/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon2/lam-io-socket
> tkill: f_kill = "/tmp/lam-abaquspar_at_xeon2/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-abaquspar_at_xeon2/lam-killfile"
> hboot: booting...
> hboot: fork /opt/lam/bin/lamd
> hboot: attempting to execute
> [1] 19821 lamd -H 192.168.0.28 -P 4271 -n 0 -o 0 -d
> n0<19818> ssi:boot:rsh: successfully launched on n0 (xeon2)
> n0<19818> ssi:boot:base:server: expecting connection from finite list
> n-1<19821> ssi:boot: Opening
> n-1<19821> ssi:boot: opening module globus
> n-1<19821> ssi:boot: initializing module globus
> n-1<19821> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n-1<19821> ssi:boot: module not available: globus
> n-1<19821> ssi:boot: opening module rsh
> n-1<19821> ssi:boot: initializing module rsh
> n-1<19821> ssi:boot:rsh: module initializing
> n-1<19821> ssi:boot:rsh:agent: rsh
> n-1<19821> ssi:boot:rsh:username: <same>
> n-1<19821> ssi:boot:rsh:verbose: 1000
> n-1<19821> ssi:boot:rsh:algorithm: linear
> n-1<19821> ssi:boot:rsh:priority: 10
> n-1<19821> ssi:boot: module available: rsh, priority: 10
> n-1<19821> ssi:boot: finalizing module globus
> n-1<19821> ssi:boot:globus: finalizing
> n-1<19821> ssi:boot: closing module globus
> n-1<19821> ssi:boot: Selected boot module rsh
> n0<19818> ssi:boot:base:server: got connection from 192.168.0.28
> n0<19818> ssi:boot:base:server: this connection is expected (n0)
> n0<19818> ssi:boot:base:server: remote lamd is at 192.168.0.28:10919
> n0<19818> ssi:boot:base:linear: booting n1 (xeon3)
> n0<19818> ssi:boot:rsh: starting lamd on (xeon3)
> n0<19818> ssi:boot:rsh: starting on n1 (xeon3): hboot -t -c
> lam-conf.lamd -d -v -s -I "-H 193.146.23
> 5.28 -P 4271 -n 1 -o 0"
> n0<19818> ssi:boot:rsh: launching remotely
> n0<19818> ssi:boot:rsh: attempting to execute "rsh xeon3 -n echo
> $SHELL"
> n0<19818> ssi:boot:rsh: remote shell /bin/bash
> n0<19818> ssi:boot:rsh: attempting to execute "rsh xeon3 -n hboot -
> t -c
> lam-conf.lamd -d -v -s -I "-
> H 192.168.0.28 -P 4271 -n 1 -o 0""
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-abaquspar_at_xeon3/lam-killfiletkill:
> removing socket file ...
> tkill: socket file: /tmp/lam-abaquspar_at_xeon3/lam-kernel-socketdtkill:
> removing IO daemon socket file
> ...
> tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon3/lam-io-socket
> tkill: f_kill = "/tmp/lam-abaquspar_at_xeon3/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-abaquspar_at_xeon3/lam-killfile"
> hboot: performing tkill
> hboot: tkill -d
> hboot: booting...
> hboot: fork /opt/lam/bin/lamd
> [1] 22404 lamd -H 192.168.0.28 -P 4271 -n 1 -o 0 -d
> n0<19818> ssi:boot:rsh: successfully launched on n1 (xeon3)
> n0<19818> ssi:boot:base:server: expecting connection from finite list
> n0<19818> ssi:boot:base:server: got connection from 192.168.0.39
> n0<19818> ssi:boot:base:server: unexpected connection; dropping
> n0<19818> ssi:boot:base:server: got connection from 192.168.0.39
> ----------------------------------------------------------------------
> -------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
>
> As far as LAM could tell, the remote process started properly, but
> then never called back. Possible reasons that this may happen:
>
> - There are network filters between the lamboot agent host and
> the remote host such that communication on random TCP ports
> is blocked
> - Network routing from the remote host to the local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output from "lamboot -d".
>
> 1. On the command line for hboot, there are two important parameters:
> one is the IP address of where the lamboot agent was invoked, the
> other is the port number that the lamboot agent is expecting the
> newly-booted process to call back on (this will be a random
> integer).
>
> 2. Manually login to the remote machine and try to telnet to the port
> indicated on the hboot command line. For example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection refused" error. If
> you get any other kind of error, it could indicate either of the
> two conditions above. Consult with your system/network
> administrator.
> ----------------------------------------------------------------------
> -------
> n0<19818> ssi:boot:base:server: failed to connect to remote lamd!
> n0<19818> ssi:boot:base:server: closing server socket
> n0<19818> ssi:boot:base:linear: aborted!
> ----------------------------------------------------------------------
> -------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> ----------------------------------------------------------------------
> -------
> n0<19824> ssi:boot: Opening
> n0<19824> ssi:boot: opening module globus
> n0<19824> ssi:boot: initializing module globus
> n0<19824> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n0<19824> ssi:boot: module not available: globus
> n0<19824> ssi:boot: opening module rsh
> n0<19824> ssi:boot: initializing module rsh
> n0<19824> ssi:boot:rsh: module initializing
> n0<19824> ssi:boot:rsh:agent: rsh
> n0<19824> ssi:boot:rsh:username: <same>
> n0<19824> ssi:boot:rsh:verbose: 1000
> n0<19824> ssi:boot:rsh:algorithm: linear
> n0<19824> ssi:boot:rsh:priority: 10
> n0<19824> ssi:boot: module available: rsh, priority: 10
> n0<19824> ssi:boot: finalizing module globus
> n0<19824> ssi:boot:globus: finalizing
> n0<19824> ssi:boot: closing module globus
> n0<19824> ssi:boot: Selected boot module rsh
> n0<19824> ssi:boot:base: looking for boot schema in following
> directories:
> n0<19824> ssi:boot:base: <current directory>
> n0<19824> ssi:boot:base: $TROLLIUSHOME/etc
> n0<19824> ssi:boot:base: $LAMHOME/etc
> n0<19824> ssi:boot:base: /opt/lam/etc
> n0<19824> ssi:boot:base: looking for boot schema file:
> n0<19824> ssi:boot:base: hostpar
> n0<19824> ssi:boot:base: found boot schema: hostpar
> n0<19824> ssi:boot:rsh: found the following hosts:
> n0<19824> ssi:boot:rsh: n0 xeon2 (cpu=2)
> n0<19824> ssi:boot:rsh: n1 xeon3 (cpu=2)
> n0<19824> ssi:boot:rsh: n2 apidell (cpu=2)
> n0<19824> ssi:boot:rsh: resolved hosts:
> n0<19824> ssi:boot:rsh: n0 xeon2 --> 192.168.0.28 (origin)
> n0<19824> ssi:boot:rsh: n1 xeon3 --> 192.168.0.38
> n0<19824> ssi:boot:rsh: n2 apidell --> 192.168.0.220
> n0<19824> ssi:boot:rsh: starting RTE procs
> n0<19824> ssi:boot:base:linear: starting
> n0<19824> ssi:boot:base:linear: booting n0 (xeon2)
> n0<19824> ssi:boot:rsh: starting wipe on (xeon2)
> n0<19824> ssi:boot:rsh: starting on n0 (xeon2): tkill -d -v
> n0<19824> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-abaquspar_at_xeon2/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-abaquspar_at_xeon2/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon2/lam-io-socket
> tkill: f_kill = "/tmp/lam-abaquspar_at_xeon2/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 19821 ...
> tkill: killed
> tkill: all finished
> n0<19824> ssi:boot:rsh: successfully launched on n0 (xeon2)
> n0<19824> ssi:boot:base:linear: booting n1 (xeon3)
> n0<19824> ssi:boot:rsh: starting wipe on (xeon3)
> n0<19824> ssi:boot:rsh: starting on n1 (xeon3): tkill -d -v
> n0<19824> ssi:boot:rsh: launching remotely
> n0<19824> ssi:boot:rsh: attempting to execute "rsh xeon3 -n echo
> $SHELL"
> n0<19824> ssi:boot:rsh: remote shell /bin/bash
> n0<19824> ssi:boot:rsh: attempting to execute "rsh xeon3 -n tkill -
> d -v"
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-abaquspar_at_xeon3/lam-killfiletkill:
> removing socket file ...
> tkill: socket file: /tmp/lam-abaquspar_at_xeon3/lam-kernel-socketdtkill:
> removing IO daemon socket file
> ...
> tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon3/lam-io-socket
> tkill: f_kill = "/tmp/lam-abaquspar_at_xeon3/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 22404 ...
> tkill: killed
> tkill: all finished
> n0<19824> ssi:boot:rsh: successfully launched on n1 (xeon3)
> n0<19824> ssi:boot:base:linear: booting n2 (apidell)
> n0<19824> ssi:boot:rsh: starting wipe on (apidell)
> n0<19824> ssi:boot:rsh: starting on n2 (apidell): tkill -d -v
> n0<19824> ssi:boot:rsh: launching remotely
> n0<19824> ssi:boot:rsh: attempting to execute "rsh apidell -n echo
> $SHELL"
> n0<19824> ssi:boot:rsh: remote shell /bin/bash
> n0<19824> ssi:boot:rsh: attempting to execute "rsh apidell -n tkill
> -d -v"
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-abaquspar_at_apidell/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-abaquspar_at_apidell/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-abaquspar_at_apidell/lam-io-socket
> tkill: f_kill = "/tmp/lam-abaquspar_at_apidell/lam-killfile"
> tkill: nothing to kill: "/tmp/lam-abaquspar_at_apidell/lam-killfile"
> n0<19824> ssi:boot:rsh: successfully launched on n2 (apidell)
> n0<19824> ssi:boot:base:linear: finished
> n0<19824> ssi:boot:rsh: all RTE procs started
> n0<19824> ssi:boot:rsh: finalizing
> n0<19824> ssi:boot: Closing
> lamboot did NOT complete successfully
>
> abaquspar_at_pc5036:~> laminfo
> LAM/MPI: 7.0.4
> Prefix: /opt/lam
> Architecture: i686-pc-linux-gnu
> Configured by: root
> Configured on: Wed Feb 1 17:37:21 CET 2006
> Configure host: xeon2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (Module v0.5)
> SSI boot: rsh (Module v1.0)
> SSI coll: lam_basic (Module v7.0)
> SSI coll: smp (Module v1.0)
> SSI rpi: crtcp (Module v1.0.1)
> SSI rpi: lamd (Module v7.0)
> SSI rpi: sysv (Module v7.0)
> SSI rpi: tcp (Module v7.0)
> SSI rpi: usysv (Module v7.0)
> abaquspar_at_pc5036:~>
>
> Thanks
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/